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Preface 



This proceedings volume includes papers accepted for presentation at the 4th 
International Workshop on Visual Form (IWVF4), held in Capri, Italy, 28-30 
May 2001. IWVF4 was sponsored by the International Association for Pattern 
Recognition (lAPR), and organized by the Department of Computer Science 
and Systems of the University of Naples “Federico 11” and the Institute of Cy- 
bernetics of the National Research Council of Italy, Arco Felice (Naples). The 
three previous IWVF were held in Capri in 1991, 1994, and 1997, organized by 
the same institutions. 

IWVF4 attracted 117 research contributions from academic and research in- 
stitutions in 26 different countries. The contributions focus on theoretical and 
applicative aspects of visual form processing such as shape representation, analy- 
sis, recognition, modeling, and retrieval. The reviewing process, accomplished by 
an international board of reviewers, listed separately, led to a technical program 
including 66 contributions. These papers cover important topics and constitute 
a collection of recent results achieved by leading research groups from several 
countries. Among the 66 accepted papers, 19 were selected for oral presentation 
and 47 for poster presentation. All accepted contributions have been scheduled in 
plenary sessions to favor as much as possible the interaction among participants. 
The program was completed by seven invited lectures, presented by internatio- 
nally well known speakers: Alfred Bruckstein (Technion, Israel), Horst Bunke 
(University of Bern, Switzerland), Terry Caelli (University of Alberta, Canada), 
Sven Dickinson (University of Toronto, Canada), Donald Hoffman (University 
of California, Irvine, USA), Josef Kittler (University of Surrey, UK), and Shi- 
mon Ullman (The Weizmann Institute of Sciences, Israel) . A panel on the topic 
State of the Art and Prospects of Research on Shape at the Dawn of the Third 
Millennium has also been scheduled to conclude IWVF4. 

IWVF4 and this proceedings volume would not have been possible without 
the financial support of the universities, research institutions, and other orga- 
nizations that contributed generously. We also wish to thank contributors, who 
responded to the call for papers in a very positive manner, invited speakers, all 
reviewers and members of the Scientific and Local Committees, as well as all 
IWVF4 participants, for their scientific contribution and enthusiasm. 
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Invariant Recognition and Processing of Planar 

Shapes 



Alfred M. Bruckstein 

Ollendorff Professor of Science 
Computer Science Department, 
Technion - IIT, Haifa 32000 Israel 
f reddyScs . technion .ac.il 



Abstract. This short paper surveys methods for planar shape recogni- 
tion and shape smoothing and processing invariant under viewing dis- 
tortions and possibly partial occlusions. It is argued that all the results 
available on these problems implicitly follow from considering two basic 
topics: invariant location of points with respect to a given shape (i.e. 
a given collection of points) and invariant displacement of points with 
regard to the given shape. 



1 Introduction 

Vision is a complex process aimed at extracting useful information from images: 
the tasks of recognizing three-dimensional shapes from their two-dimensional 
projections, of evaluating distances and depths and spatial relationships between 
objects are tantamount to what we mean by seeing. In spite of promises, in the 
early 60’s, that within a decade computers will be able “to see”, we are not even 
close today to having machines that can recognize objects in images the way 
even the youngest of children are capable to do. As a technological challenge, 
the process of vision has taught us a lesson in modesty: we are indeed quite 
limited in what we can accomplish in this domain, even via deep mathematical 
results and the deployment ever-more-powerful electronic computing devices. In 
order to address some practical technological image analysis questions and in 
order to appreciate the complexity of the issues involved in “seeing” it helps to 
consider simplified vision problems such as “character recognition” and other 
“model-based planar shape recognition” problems and see how far our theories 
(“brain-power”) and experiments (“number-crunching power”) can take us to- 
ward working systems that accomplish useful image analysis tasks. As a result of 
such efforts we do have a few vision systems that work and there is a vast liter- 
ature in the “hot” field of computer vision dealing with representation, approxi- 
mation, completion, enhancement, smoothing exaggeration/characterization and 
recognition of planar shapes. This paper surveys methods for planar shape recog- 
nition and processing (smoothing, enhancement, exaggeration etc.) invariant un- 
der distortions that occur when looking at the planar shapes from various points 
of view. These distortions are Euclidean, Similarity, Affine and Projective maps 
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of the plane to itself and model the possible viewing projections of the plane 
where a shape resides, into the image plane of a pinhole camera, capturing the 
shape from arbitrary locations. A further problem one must often deal with when 
looking at shapes is occlusion. If several planar shapes are superimposed in the 
plane or are floating in 3D-space they can and will (fully, or partially) occlude 
each other. Under full occlusion there is of course no hope for recognition, but 
how about partial occlusion? Can we recognize a planar shape from a partial 
glimpse of its contour? Is there enough information in a portion of the projection 
of a planar shape to enable its recognition? We shall here address such questions 
too. The main goal of this paper will be to point out that all the proposed 
methods to address the above mentioned topics implicitly require the solution 
of the following two basic problems: distortion-invariant location of points with 
respect to given planar shape (which for our purposes can be a planar region 
with curved or polygonal boundaries or in fact an arbitrary set of points) and 
invariant displacement, motion or relocation of points with respect to the given 
shape. 



2 Invariant Point Locations and Displacements 



A planar shape S, for our purpose, will be a set of points in points that usually 
specify a connected a planar region with a boundary that is either smooth or 
polygonal. The viewing distortions are classes of transformations — >■ 

parameterized by a set of values <f>, and while the class of transformations is 
assumed to be known to us, the exact values of the parameters is not. The 
classes of transformations considered are continuous groups of transformations 
modeling various imaging modalities, the important examples being: 



— The Euclidean motions (parameterized by a rotation angle and a two- 
dimensional translation vector, i.e. </> has 3 parameters). 

— Similarity transformations (Euclidean motions complemented by uniform 
scaling transformations, i.e. \(j)\ = 4 parameters). 

~ Equi- Affine and Affine Mappings (parameterized by 2 x 2 matrix - 4 param- 
eters - or 3 if the matrix has determinant 1 - and a translation vector, i.e. 
|(()| = 6 or 5 parameters). 

~ Projective Transformations (modeling the perspective projection with \4>\ = 
8 parameters). 



Given a planar shape S C R^ and a class of viewing distortions — >■ R^ 

we consider the following problem: 
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Two observers A and B look at 5”^ = V^^(S) and at Sb = V,f,g(S) 
respectively without knowing (j)A and <pB- In other words A and B look 
at S from different points of view and the details of their camera location 
orientation and settings are unknown to them. Observer A chooses a 
point Pa in its image plane Ft? , and wants to describe its location w.r.t. 
V(f,A (S) to observer B, in order to enable him to locate the corresponding 
point Pb = y<j>B(y^A^PA))- A knows that B looks at Sb = = 

V 03 but this is all the information available to A and B. How 

should A describe the location of Pa w.r.t. Sa to B1 
Solving this problem raises the issue of characterizing a position (Pa) in 
the plane of Sa in a way that is invariant to the class of transformations 
Vlz>. 



To give a very simple example: Let S' be a set of indistinguish- 
able points in the plane {Pi, P2, ■ ■ ■ , Pn} and be the class of Eu- 
clidean motions. A new point P should be described to observers of 

V^{Pi, . . . , Pn} = {V,j,{Pi),V^{P 2 ) . ■ .V^{Pn)} so that they will be able 

to locate V^{P) in their “images”. How should we do this? Well, we shall 
have to describe P’s location w.r.t. {Pi,P 2 , ■ • • ,Pn} in an Euclidean-invariant 
way. We know that Euclidean motions preserve length and angles between line 
segments so there are several ways to provide invariant coordinates in the plane 
w.r.t. the shape S. The origin of an invariant coordinate system could be the 
Euclidean-invariant (in fact even Affine-invariant) centroid of the points S, i.e. 

Os = As one of axes (say the x-axis) of the “shape-adapted 

invariant” coordinate system, one may choose the longest or shortest (or closest 
in length to the “average” length) vector among {OPi} for i = 

This being settled, all one has to do is to specify P in this adapted and 
Euclidean-invariant coordinate system with origin at Og and orthogonal axes 
with the x-axis chosen as described above. Note that other solutions are 
possible. We here assumed that the points of S are indistinguishable, but 
otherwise the problem would be much simpler. Note also that ambiguous 
situations can and do arise. In case all the points of S form a regular Wgon, 
there are N equal length vectors {OPi} i = 1,2, . . . , N and we can not specify 
uniquely an x-axis. But, a moment of thought will reveal that in this case the lo- 
cation of any point in the plane is inherently ambiguous up to rotations of 2tt/N. 



Contemplating the above-presented simple example one realizes that solving 
the problem of invariant point location is heavily based on the invariants of 
the continuous group of transformations V^. The centroid of S, Os, an invariant 
under enabled the description of P using a distance d{Os,P), the length of 
vector OsP (again a E^-invariant) , up to a further parameter that locates P 
on the circle centered at Os with radius d{OsP), and then the “variability” or 
inherent “richness” of the geometry of S enables the reduction of the remaining 
ambiguity. 
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Suppose next that we want not only to locate points in ways that are invariant 
under but we also want to perform invariant motions. This problem is already 
completely solved in the above presented example, once an “S'-shape-adapted” 
coordinate system becomes available. Any motion can be defined with respect to 
this coordinate system and hence invariantly reproduced by all viewers of S'. In 
fact, when we establish an adapted frame of references we implicitly determine 
the transformation parameters, <f>, and effectively undo the action of V^. 

To complicate the matters further consider the possibility that the shape S 
will be partially occluded in some of its views. Can we, in this case, establish the 
location of P invariantly and perform some invariant motions as before? Clearly, 
in the example when S is a point constellation made of N indistinguishable 
points, if we assume that occlusion can remove arbitrarily some of the points, 
the situation becomes rather hopeless. However, if the occlusion is restricted 
to wiping out only points covered by a disk of radius limited to some R, or 
alternatively, we can assume that we shall always see all points within a certain 
radius around an (unknown) center point in the plane, the prospects of being able 
to solve the problem, at least in certain lucky instances, are much better. Indeed, 
returning to our simple example, assume that we have many indistinguishable 
landmark points (forming a “reference” shape S in the plane), and that a mobile 
robot navigates in the plane, and has a radius of sensing or visibility of R. At each 
location P of the robot in the plane it will see all points of S whose distance from 
P is less than R, up to its own arbitrary rotation. Hence, the question of being 
able to specify p from this data becomes the problem of robotic self location in 
this context. So given a reference map (showing the “landmark” points of S in 
some “absolute” coordinate system), we want the robot to be able to determine 
its location on this map from what it sees (i.e. a portion of the points of S 
translated by P and seen in an arbitrary rotated coordinate system). To locate 
itself the robot can (and should) do the following: 

Using the arbitrarily rotated constellation of points of S within its radius of 
sensing, i.e. S{P,R) = {Qg{Pi — P)/Pi G S, d{{PiP) < i?} when fig is a 
rotation matrix 2x2 about P, “search” in S' a similar constellation by checking 
various center points (2 parameters: Xp,yp) and rotations (1 parameter: 9). 

As stated above, this solution involves a horrendous 3-dimensional search 
and it must be avoided by using various tricks like signatures and (geometric) 
hashing based on “distances” from P to fig{Pi — P) and distances between the 
Pi’s seen from P. This leads to more efficient Hough-Transform like solutions 
for the self location problem. 

It would help if the points of S would be ordered on a curve, say a polygonal 
boundary of a planar region, or would be discrete landmarks on a continuous 
but visible boundary curve in the plane. Fortunately for those addressing shape 
analysis problems this is most often the case. 
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3 Invariant Bonndary Signatures for Recognition under 
Partial Occlusions 

If the shape S' is a region of with a boundary curve dS = C that is either 
smooth or polygonal, we shall have to address the problem of recognizing the 
shape S from V^-distorted portions of its boundary. Our claim is that if we 
can effectively solve the problem of locating a point P on the curve (7 in a V^- 
invariant way based on the local behavior of C in a neighborhood of P, then we 
shall have a way to detect the possible presence of the shape S from a portion 
of its boundary. How can we locate P on C in V^-invariant ways? We shall have 
to associate to P a set of numbers (“co-ordinates” or “signatures”) that are 
invariant under the class of V^-transformations. To do so, one again has to rely 
on known geometric invariants of the group of viewing transformation assumed 
to act on S to produce its image. The fact that we live on a curve C makes life 
a bit easier. 

As an example, consider the case where (7 is a polygonal curve and is the 
group of Affine-transformations. Since all the viewing transformations map lines 
into lines and hence the vertices of the poly-line C into vertices of a transformed 
poly-line V(i,((7) we can define the local neighborhood of each vertex C{i) of C, 
as the “ordered” constellation of 2n-|-l points {(7(i — n), . . . , <7(i — l),(7(z),C'(t-|- 

1) , . . . , C{i+n)} and associate to C{i) invariants of based on this constellation 
of points. Affine transformations are known to scale areas by the determinant of 
their associated 2x2 matrix of “shear and scale” parameters, hence we know 
that ratios of corresponding areas will be affine invariant. Therefore we could 
consider the areas of the triangles Ai = [{C{i — l)C{i)C{i + 1)],7\2 = [C{i — 

2) C{i)C{i -I- 2)] • • • An = [C{i — n),C{i),C{i + n)] and associate to C{i) a vector 
of ratios of the type {Ak/ Ai\k,l = {1, 2, . . . , n}, /c yf /}. This vector will be 
invariant under the affine group of viewing transformation and will (hopefully) 
uniquely characterize the point C(i) in an affine-invariant way. 

The ideas outlined so far provide us a procedure for invariantly characteriz- 
ing the vertices of a poly-line, however, we can use similar ideas to also locate 
intermediate points situated on the line segments connecting them. Note that 
the number n in the example above is a locality-parameter : smaller n’s imply 
more local characterization in terms of the size of neighborhoods on the curve 
(7. Contemplating the foregoing example we may ask how to adapt this method 
to smooth curves where there are no vertices to enable us to count “landmark” 
points to the left and to the right of the chosen vertex in view-invariant ways. 
There is a beautiful body of mathematical work on invariant differential geom- 
etry providing differential invariants associated to smooth curves and surfaces, 
work that essentially carried out Klein’s Erlangen program for differential geom- 
etry, and is reported on in books and papers that appeared about 100 years ago. 
The differential invariants enable one to determine a V^-invariant metric, i.e. a 
way to measure “length” on the curve C invariant with respect to the viewing 
distortion, similar to the way one finds, rather easily, the Euclidean-invariant 
arclength on smooth curves. If we have an invariant metric, we claim that our 
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problem of invariant point characterizations on C can be readily put in the same 
framework as in the example of a poly-line. Indeed we can now use the invariant 
metric to locate to the left and right of P on C - if we define P = C(0), and 
describe C as C{iJ,) where /r is the invariant metric parameterization of C about 
C(0) = P - the points {C(0 — nA), . . . , C(0 — Z\), C(0 -I- A),. . . , (7(0 -I- nA}, 
and these 2n -I- 1 points will form an invariant constellation of landmarks about 
P = (7(0). Here A is arbitrarily chosen as a small “invariant” distance in terms 
of the invariant metric. It is very beautiful to see that letting Z\ \ 0 one often 
recovers, from the global invariant quantities that were defined on the constella- 
tion of points about (7(0) = P, differential invariant quantities that correspond 
to known “generalized invariant curvatures” (that generalize the classical cur- 
vature obtained if is the simplest, Euclidean viewing distortion). Therefore 
to invariantly locate a point P on (7, we can use the existing invariant met- 
rics on (7 (if (7 is a polygon - the ordering of vertices is an invariant metric!) 
to determine about P an invariant constellation of “landmark” points on the 
boundary curve and use global invariants of to associate to P an “invari- 
ant signature vector” Ip{A). If A \ 0 this vector yields, for “good” choices of 
invariant quantities “generalized invariant curvatures” for the various viewing 
groups of transformations V^. 

We however do not propose to let Z\ \ 0. Z\ is a locality parameter (together 
with n) and we could use several small but finite values for A to produce (what 
we called) a “scale-space” of invariant signature vectors {Ip}AieRange- 

This freedom allows us to associate to a curve parameterized in terms 

of its “invariant metric or arclength”, a vector valued scale space of signature 
functions {Ip{Aij p)} Ai^Range, that will characterize it in a view-invariant way. 
This characterization is local (its locality being in fact under our control via 
A and n) and hence useful to recognize portions of boundaries in scenes where 
planar shapes appear both distorted and partially occluded. 

4 Invariant Smoothing and Processing of Shapes 

Smoothing and other processes of modifying and enhancing planar shapes in- 
volves moving their points to new locations. In the spirit of the discussion above, 
we want to do this in “viewing-distortion-invariant” ways. To do so we have to 
locate, i.e. invariantly characterize the points of a shape S (or of its boundary 
(7 = 5S) and then invariantly move them to new locations in the plane. The 
discussions of the previous sections showed us various ways to invariantly locate 
points in the plane of S (or on S) . Moving points around is not much more diffi- 
cult. We shall have to associate to each point (of S, or in the plane of S) a vector 
whose direction and length have been defined so as to take us to another point, 
in a way that is V^-invariant. In the example of S being a constellation of points 
with a robot using the points of S to locate itself at P, we also want it to deter- 
mine a new place to go, i.e. to determine a point Pnew, so as to have the property 
that from V^{P) a robot using the points {V 0 (Pi) . . . V 0 (Pat)} will be able to 
both locate itself and move to V^{Pnew)- Of course on shapes we shall have to do 
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motions that achieve certain goals like smoothing the shape or enhancing it in 
desirable ways. To design view-distortion invariant motions, we can (and indeed 
must) rely on invariant point characterizations. Suppose we are at a point P on 
the boundary C = 6S of a shape S, and we have established a constellation of 
landmark points about P. We can use the invariant point constellation about P 
to define a V^-invariant motion from P to Pnew 

Let us consider again a simple example: if is the Affine group of viewing 
transformations, the centroid of the point constellations about P is an invariantly 
defined candidate for Pnew Indeed it is an average of points around P and the 
process of moving P to such a Pnew or, differentially, toward such a new position 
can (relatively easily) be proved to provide an affine invariant shape smoothing 
operation. If S' is a polygonal shape, i.e. dS = C is a poly-line, then moving 
the vertices according to such a smoothing operation can be shown to shrink 
any shape into a polygonal ellipse, the affine image of a regular polygon with 
the same number of vertices as the original shape. In fact ellipses and polygonal 
ellipses are the results of many reasonably defined invariant averaging processes. 

5 Concluding Remarks 

The main point of this paper is the thesis that in doing “practical” view-point 
invariant shape recognition or shape processing for smoothing or enhancement, 
one has to rely on the interplay between global and local (or even differential) 
invariants of the group of viewing transformations. 

Invariant reparameterization of curves based on “adapted metrics” enables 
us to design generalized local (not necessarily differential) signatures for 
partially occluded recognition. These signatures have many incarnations - they 
can be scalars, vectors or even a scale-space of values associated to each point 
on shape boundaries. They are sometimes quite easy to derive, and generalize 
the differential concept of “invariant curvature” in meaningful ways. A study of 
the interplay between local and global invariances of viewing transformations 
is also very useful for shape smoothing, generating invariant scale-space shape 
representations, and leads to various invariant shape enhancement operations. 

Many students, collaborators and academic colleagues and friends have 
helped me develop the point of view exposed in this paper. I am grateful to 
all of them for the many hours of discussions and debates on these topics, for 
agreeing and disagreeing with me, for sometimes fighting and competing, and 
often joining me on my personal journey into the field of applied invariance 
theory. The list of papers provided below are our contributions to this area and 
further extremely relevant contributions to invariant shape signatures and shape 
processing by Weiss, Cyganski, VanGool, Brill, Morel, Faugeras, Olver, Adler, 
Cipolla and their colleagues can easily be located in the literature. 
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Abstract. Structural pattern recognition is characterized by the 
representation of patterns in terms of symbolic data structures, such as 
strings, trees, and graphs. In this paper we review recent developments 
in this field. The focus of the paper will be on new methods that allow 
to transfer some well established procedures from statistical pattern 
recognition to the symbolic domain. Examples from visual form anal- 
ysis will be given to demonstrate the feasibility of the proposed methods. 

Keywords: structural pattern recognition; string, tree and graph match- 
ing; edit distance; median, generalized median, and weighted mean com- 
putation; clustering; self-organizing map. 



1 Introduction 

Pattern recognition is based on the concept of similarity. If the objects under 
consideration are represented by means of feature vectors from an n-dimensional 
feature space, then similarity can be measured by means of distance functions, 
such as Euclidean or Mahalanobis distance. These categories of distance mea- 
sures belong to the statistical approach to pattern recognition • In Hi® present 
paper we focus on structural pattern recognition 00!. This approach is charac- 
terized by the use of symbolic data structures, such as strings, trees, or graphs for 
pattern representation. Symbolic data structures are more powerful than feature 
vectors, because they allow a variable number of features to be used. Moreover, 
not only unary properties of the patterns under study can be represented, but 
also contextual relationships between different patterns and subpatterns. 

In order to compute the similarity of symbolic data structures, suitable mea- 
sures, for either similarity or distance, must be provided. One widely used class 
of distance functions in the domain of strings is string edit distance 0. This 
distance function can be extended to the domain of graphs mm- For other 
distance functions on trees and graphs see iscmii. 

Although symbolic data structures are more powerful than feature vectors 
for pattern representation, one of the shortcomings of the structural approach is 
the lack of a rich set of basic mathematical tools. As a matter of fact, the vast 
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majority of all recognition procedures used in the structural domain are based 
on nearest neighbor classification using one of the distance measures mentioned 
above. By contrast, a large set of methods have become available in statistical 
pattern recognition, including various types of neural networks, decision theoretic 
methods, machine learning procedures, and clustering algorithms 0. 

In this paper we put particular emphasis on novel work in the area of struc- 
tural pattern recognition that aims at bridging the gap between statistical and 
structural pattern recognition in the sense that it may yield a basis for adapting 
various techniques from the statistical to the structural domain. Especially the 
topic of clustering symbolic structures will be addressed. 

In the next section, basic concepts from the symbolic domain will be intro- 
duced. Then in Section 0 median and generalized median of a set of symbolic 
structures will be presented and computational procedures discussed. The topic 
of Section 21 is weighted mean of symbolic data structures. In Section 21 it will 
be shown how the concepts introduced in Sections 21 to 2| can be used for the 
purpose of graph clustering. Application examples with an emphasis on visual 
form analysis will be given in Section El Finally a summary and conclusions will 
be presented in Section 0 

2 Basic Concepts in Structural Matching 

Due to space limitations, we’ll explicitly mention in this section only concepts 
and algorithms that are based on graph representations. The corresponding con- 
cepts and algorithms for strings and trees can be derived as special cases. 

In a graph used in pattern recognition, the nodes typically represent objects 
or parts of objects, while the edges describe relations between objects or object 
parts. Formally, a graph is a 4-tuple, g = (V, E, /r, v) where V is the set of nodes, 
E C V xV \s the set of edges, g, : V ^ Ly is a function assigning labels to 
the nodes, and v : E ^ is a function assigning labels to the edges. In this 
definition, Ly and Le is the set of node and edge labels, respectively. 

If we delete some nodes from a graph g, together with their incident edges, 
we obtain a subgraph g' Q g. A graph isomorphism from a graph g to a graph g' 
is a bijective mapping from the nodes of g to the nodes of g' that preserves all 
labels and the structure of the edges. Similarly, a subgraph isomorphism from g' 
to g is an isomorphism from g' to a subgraph of g. Another important concept 
is maximum common subgraph. A maximum common subgraph of two graphs, 
g and g', is a graph g" that is a subgraph of both g and g' and has, among all 
possible subgraphs of g and g' , the maximum number of nodes. Notice that the 
maximum common subgraph of two graphs is usually not unique. 

Graph matching is a generic term that denotes the computation of any of the 
concepts introduced above, as well as graph edit distance (see further below). 
Graph isomorphism is a useful concept to find out if two objects are the same, 
up to invariance properties inherent to the underlying graph representation. 
Similarly, subgraph isomorphism can be used to find out if one object is part 
of another object, or if one object is present in a group of objects. Maximum 
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common subgraph can be used to measure the similarity of objects even if there 
exists no graph or subgraph isomorphism between the corresponding graphs. 
Clearly, the larger the maximum common subgraph of two graphs is, the greater 
is their similarity. 

Real world objects are usually affected by noise such that the graph repre- 
sentation of identical objects may not exactly match. Therefore it is necessary to 
integrate some degree of error tolerance into the graph matching process. A pow- 
erful alternative to maximum common subgraph computation is error -tolerant 
graph matching using graph edit distance. In its most general form, a graph edit 
operation is either a deletion, insertion, or substitution (i.e. label change). Edit 
operations can be applied to nodes as well as to edges. They are used to model 
the errors and distortions that may change an ideal graph into a distorted ver- 
sion. In order to enhance the modeling capabilities, a cost is usually assigned to 
each edit operation. The costs are real non-negative numbers. The higher the 
cost of an edit operation is, the less likely is the corresponding error to occur. 
The costs are application dependant and must be defined by the system designer 
based on knowledge from the underlying domain. 

The concepts of error-tolerant graph matching and graph edit distance will be 
described only informally. For formal treatments, see [Z|, for example. An error- 
tolerant graph matching can be understood as a sequence of edit operations that 
transform graph gi into g2 such that the accumulated cost of all edit operations 
needed for this transformation is minimized. The cost associated with such a 
sequence of edit operations is called the graph edit distance of gi and (72, and is 
written as 

d{gi,g2) = min{c(S')|5' is a sequence of edit operations that (1) 

transform gi into g±} 

where c{S) is the accumulated cost of all edit operations in sequence S. Clearly, 
if gi = (72 then no edit operation is needed and d{gi,g2) = 0. On the other hand, 
the more gi and 772 differ from each other, the more edit operations are needed, 
and the larger is d(gi,g2). 

The concepts that correspond to graph isomorphism, subgraph isomorphism, 
maximum common subgraph and graph edit distance in the domain of strings 
are identity of strings, subsequence, longest common subsequence and string 
edit distance, respectively. Notice, however, that the corresponding algorithms 
in the string domain are of lower complexity. For example, string edit distance 
is quadratic in the length of the two strings under comparison 0, while graph 
edit distance is exponential jjj. 

3 Median and Generalized Median of Symbolic Data 
Structures 

Clustering is a key concept in pattern recognition. While a large number of 
clustering algorithms have become available in the domain of statistical pattern 
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recognition, relatively little attention has been paid to the clustering of symbolic 
structures, such as strings, trees, or graphs . In principle, however, given 

a suitable similarity (or dissimilarity) measure, for example, edit distance, many 
of the clustering algorithms originally developed in the context of statistical 
pattern recognition, can be applied in the symbolic domain. 

In this section we review work on a particular problem in clustering, namely, 
the representation of a set of similar objects through just a single prototype. This 
problem typically occurs after a set of objects has been partitioned into clusters. 
Rather than storing all members of a cluster, only one, or a few, representative 
elements are being retained. 

Assume that we are given a set P = {pi,---,Pn} of patterns and some 
distance function d{pi,p2) to measure the dissimilarity between patterns pi and 
P2- A straightforward approach to capturing the essential information in set P 
is to find a pattern p that minimizes the average distance to all patterns in P, 
i.e.. 



Let’s call pattern p the generalized median of P. If we constrain the repre- 
sentative to be a member of the given set P, then the resultant pattern 



is called the median of P. 

In the context of this paper we consider the case where the patterns are 
represented by means of symbolic data structures, particulary strings or graphs. 
The task considered in the following is the computation of the median and 
generalized median of a set of strings or graphs. 

First, we notice that the computation of the median of symbolic structures 
is a straightforward task. It requires just O(n^) distance computations. (Notice, 
however, that each of these distance computations has a high computational 
complexity, in general.) But the median is restricted in the sense that it can’t 
really generalize from the given patterns represented by set P. Therefore, gener- 
alized median is the more powerful and interesting concept. However, the actual 
computational procedure for finding a generalized median of a given set of sym- 
bolic structures is no longer obvious. 

The concept of median and generalized median of a set of strings was in- 
troduced in the pattern recognition literature in ng. An optimal procedure for 
computing the median of a set of strings was proposed in m- This procedure is 
an extension of the algorithm for string edit distance computation However, 
this extension suffers from a high computational complexity, which is exponen- 
tial in the number of strings in set P. Hence its applicability is restricted to 
rather small sets and short strings. A suboptimal version of the same algorithm 
and its application to the postprocessing of OCR results is described in HZl. 
For another version of the method that uses positional information to prune the 



1 . ^ 

p = arg min - d{p,p^) 




( 2 ) 




( 3 ) 



Recent Advances in Structural Pattern Recognition 



15 



high-dimensional search space and its application to handwriting recognition 
see m Other suboptimal procedures for the computation of generalized me- 
dian of a set of strings are described in |1HI20I21I22| . In genetic algorithms 
are studied. 

The high computational complexity of generalized median computation be- 
comes even more severe if graphs rather than strins are used for pattern repre- 
sentation. Nevertheless, optimal algorithms and suboptimal methods based on 
genetic search are investigated in j24t/’,5] ■ An application to graphical symbol 
recognition is discussed in m- 

4 Weighted Mean of Symbolic Data Structures 

Consider two patterns, pi and p2- We call pattern p a weighted mean of pi and p2 
if, for some real number a with 0 < a < d{pi,p2), the following two conditions 
hold: 



d{pi,p) = a, (4) 

d{pi,P 2 ) = Oi + d{p,p 2 ). (5) 

Here d(.,.) denotes again some distance function. Clearly, if pi and p2 are 
represented in terms of feature vectors then weighted mean computation can be 
easily solved by means of vector addition. In g2,‘II27j a procedure for the com- 
putation of weighted mean in the domain of strings has been proposed. This 
procedure is based on the ’classical’ algorithm for string edit distance compu- 
tation For a pair of strings, pi and p2, it first computes the edit distance. 
This yields an edit matrix with an optimal path. If one takes a subset, S, of 
the edit operations on the optimal path and applies them on pi, a new string, 
p, is obtained. It can be proven that string p is a weighted mean of pi and p2, 
obeying equations o and 6 with a being equal to the accumulated cost of all 
edit operations in subset S. Furthermore, it can be proven that the procedure 
given in m is complete, i.e., there is no weighted mean of two strings, pi and 
P2 , that can’t be generated by means of the given procedure. 

Weighted mean is a useful tool for tasks where a pattern, pi, has to be 
changed so as to make it more similar to another pattern, p2. Intuitively speaking, 
the weighted mean, p, of a pair of patterns, pi and p2, is a structure that is 
located between pi and p2 in the symbolic pattern space. 

In |2S| weighted mean has been extended from the domain of strings to the 
domain of graphs. A concrete application of weighted mean in the graph domain 
will be described in the next section. 

5 Clusteriug of Syuibolic Data Structures 

Self organizing map (som) is a very well established method in the area of sta- 
tistical pattern recognition and neural networks Eg. A pseudo code description 
of the classical som-algorithm is given in Fig. Q The algorithm can serve two 
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som-algorithm 

(1) input: a set of patterns, X — {xi, . . . ,a;jv} 

(2) output: a set of prototypes, Y = {yi , . . . , i/m} 

(3) begin 

(4) initialize Y = {t/i, . . . , 2 /m} randomly 

(5) repeat select x £ X randomly 

(7) find y* such that d(x, y*) = min{d(a;, y)\y £ Y} 

(8) for all y £ N[y*) do 

(9) y = y + a{x - y) 

(10) reduce learning rate a 

(11) until termination condition is true 

(12) end 



Fig. 1. The som-algorithm 



purposes, either clustering or mapping a high-dimensional pattern space to a 
lower-dimensional one. In the present paper we focus on its application to clus- 
tering. Given a set of patterns, X, the algorithm returns a prototype yi for each 
cluster i. The prototypes are sometimes called neurons. The number of clus- 
ters, M, is a parameter that must be provided a priori. In the algorithm, first 
each prototype yi is randomly initialized (line 4). In the main loop (lines 5-11) 
one randomly selects an element x G X and determines the neuron y* that is 
nearest to x. In the inner loop (lines 8,9) one considers all neurons y that are 
within a neighborhood N{y*) of y*, including y* , and updates them according 
to the formula in line 9. The effect of neuron updating is to move neuron y 
closer to pattern x. The degree by which y is moved towards x is controlled by 
the parameter a, which is called the learning rate. It has to be noted that a is 
dependent on the distance between y and y* , i.e. the smaller this distance is the 
larger is the change on neuron y. After each iteration through the repeat-loop, 
the learning rate a is reduced by a small amount, thus facilitating convergence 
of the algorithm. It can be expected that after a sufficient number of iterations 
the 2 /i’s have moved into areas where many Xj's are concentrated. Hence each /// 
can be regarded a cluster center. The cluster around center yi consists of exactly 
those patterns that have yi as closest neuron. 

In the original version of the som-algorithm all Xj and yi are feature vectors. 
In this section, its adaption to the graph domain is discussed, see I3UI31I . (The 
algorithm can also be adapted to the string domain; see To make the 

algorithm applicable in the graph domain, two new concepts are needed. First, 
a graph distance measure has to be provided in order to find graph y* that is 
closest to X (see line 7). Secondly, a graph updating procedure implementing line 
9 has to be found. If we use graph edit distance, and the weighted mean graph 
computation procedure (see Section El as updating method, the algorithm in 
Fig. Dean in fact be applied in the graph domain. For further implement ational 
details see |.31)l,31j . 

Using the concepts of edit distance and generalized median, also the well- 
known /c-means clustering algorithm can be transferred from the domain 



Recent Advances in Structural Pattern Recognition 



17 








Fig. 3. Generalized median of the digits in Fig. 0 



of feature vectors to the symbolic domain. In this case no updating procedure 
as given in Section 2]is needed. Instead, cluster centers are computed using the 
generalized median as described in Section 0 In fact, /c-means clustering can be 
understood as a batch version of som, where neuron updating is done only after 
a complete cycle through all input patterns, rather than after presentation of 
each individual pattern PH. 

Both som and /c-means clustering require the number of clusters being known 
beforehand. In PHI the application of validation indices was studied in order to 
find the most appropriate number of clusters automatically. 

6 Application Examples 

In this section we’ll first show an example of generalized median computation for 
the domain of strings. In the example online handwritten digits from a subset m 
of UNIPEN database PZ] are used. Each digit is originally given as a sequence 
s = {x\, yi), . . . , {xn, Vn) of points in the x — y-plane. In order to transform such 
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y 

Fig. 4. A sequence of 11 instances of digit 2: x, Z2, ■ . . , zg, y. The rst and last digits 
are from the database m- The other digits are weighted means computed for various 
values of a 



a sequence of points into a string, we first resample the given data points such 
that the distance between any consecutive pair of points has a constant value, 
A. That is, s is transfomed into sequence s' = (xi,yi), . . . , where 

\{xi+i,y^j^i) — {xi,y^)\ = Z\ for i = 1,...,7 ti — 1. Then from sequence s' a 
string zi . . . Zm-i is generated where Zi is the vector pointing from (xi,y^) to 
{xi+i,yi+i)- 

The costs of the edit operations are defined as follows: c(z — >■ e) = c(e — >■ z) = 
l^l = A, c(z —>■ z') = \z — z'\. Notice that the minimum cost of a substitution is 
equal to zero (if and only if z = z'), while the maximum cost is 2A. The latter 
case occurs if z and z' are parallel and have opposite direction. 

Ten different instances of digit 3 are shown in Fig. 0 . Their generalized median 
obtained with the algorithm described in is presented in Fig. 0 Intuitively 
speaking this instance of digit 3 represents the characteristic features of the 
samples in Fig. 0very well. It suitably captures the variation in shape among 
the different patterns in the given set. 
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Fig. 5. 15 characters each representing a different class 



Next we show an example of weighted mean computation m- The same 
string representation, edit costs and database as for generalized median com- 
putation were used in this example. Fig. 0 shows a sequence of 11 instances of 
digit 2. The first and last instance of this sequence, x and y, are taken from the 
database m- All other digits, zi, Z 2 , ■ ■ ■ , zg are generated under the procedure 
described in Section 0] String Zi corresponds to the weighted mean of x and y 
for a = d{x, y); i = 1, . . . , 9. It can be clearly observed that with an increasing 
value of a the characters represented by string Zi are becoming more and more 
similar to y. 

Finally, an example of graph clustering using the som algorithm introduced 
in Section El will be given EH- In this example, graph representations of capital 
characters were used. In Fig. 0, 15 characters are shown, each representing a 
different class. The characters are composed of straight line segments. In the 
corresponding graphs, each line segment is represented by a node with the coor- 
dinates of the endpoints in the image plane as attributes. No edges are included 
in this kind of graph representation. The edit costs are defined as follows. The 
cost of deleting or inserting a line segment is proportional to its length, while 
substitution costs correspond to the difference in length of the two considered 
line segments. 

For each of the 15 prototypical characters shown in Fig. El ten distorted 
versions were generated. Examples of distorted A’s and E’s are shown in Fig. El 
and 0 respectively. The degree of distortion of the other characters is similar to 
Fig. El and Fig. 0 As a result of the distortion procedure, a sample set of 150 
characters were obtained. Although the identity of each sample was known, this 
information was not used in the experiment described below. 

The clustering algorithm described in Section 0 was run on the set of 150 
graphs representing the (unlabeled) sample set of characters, with the number 
of clusters set to 15. As the algorithm is non-deterministic, a total of 10 runs 
were executed. The cluster centers obtained in one of these runs are shown in 
Fig. 0 Obviously, all cluster centers are correct in the sense that they represent 
meaningful prototypes of the different character classes. In all other runs similar 
results were obtained i.e., in none of the runs an incorrect prototype was gen- 
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Fig. 6. Ten distorted versions of character A 




Fig. 7. Ten distorted versions of character E 



erated. Also all of the 150 given input patterns were assigned to their correct 
cluster center. 

From these experiments it can be concluded that the new graph clustering 
algorithm is able to produce a meaningful partition of a given set of graphs into 
clusters and find an appropriate prototype of each cluster. 



7 Conclusions 

In this paper, some recent developments in structural pattern recognition are 
reviewed. In particular the median and generalized median of a set as well as 
the weighted mean of a pair of symbolic structures are discussed. These concepts 
are interesting in their own right. But their combination, together with the edit 
distance of symbolic structures, allows to extend clustering procedures, such as k- 
means or som, from vectorial pattern representations into the symbolic domain. 
A number of examples from the area of shape analysis are given to demonstrate 
the applicability of the proposed methods. 

From the general point of view, the procedures discussed in this paper can 
be regarded a contribution towards bringing the disciplines of statistical and 
structural pattern recognition closer together. In the field of statistical pattern 
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Fig. 8. Cluster centers obtained in one of the experimental runs 



recognition, a rich set of methods and procedures have become available during 
the past decades. On the other hand, the representational power of feature vec- 
tors is limited when compared to symbolic data structures used in the structural 
domain. But structural pattern recognition suffers from the fact, that only a 
limited repository of mathematical tools have become available. In fact, most 
recognition procedures in the structural approach follow the nearest-neighbor 
paradigm, where an unknown pattern is matched against the full set of proto- 
types. The present paper shows how clustering procedures that were originally 
developed in statistical pattern recognition, can be adapted to the symbolic do- 
main. It can be expected that a similar adaption will be possible for other types 
of pattern recognition algorithms. 
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Abstract. In this paper we show how the shape and dynamics of 
complex actions can be encoded using the intrinsic curvature and torsion 
signatures of their component actions. We then show how such invariant 
signatures can be integrated into a Dynamical Bayesian Network which 
compiles efficient recurrent rules for predicting and recognizing complex 
actions. An application in skill analysis is used to illustrate our approach. 

Keywords: Differential Geometry, Invariance, Dynamical Bayesian Net- 
works, hidden Markov models, learning complex actions. 



1 Introduction 



There is an ever increasing number of tasks where it would be useful to be able to 
have machines encode, predict and recognize complex spatio-temporal patterns 
defined by the dynamics and trajectories of interacting components in 3D over 
time. These include gesture recognition, robot skill acquisition, prosthetics and 
skill training for human expertise. Current work in this area is characterized by 
two quite different approaches. For those consistent with a long tradition of Kine- 
matics, Robotics and Biomedical Engineering, the problem is typically posed in 
terms of deterministic control models involving solutions to forward (dynamics 
to trajectories) or inverse (trajectories to dynamics) kinematics |Q. The other 
exclusively behavioural approach (which predominates the gesture recognition 
literature) uses Machine Learning and Pattern Recognition approaches to the 
recognition of complex actions [hl3lllll| . The benefits of this latter perspective 
is that it is inherently concerned with recognition within the context of variabil- 
ity. The benefits of the former is that it allows for more detailed modeling of 
underlying processes. In this paper we endeavor, more or less, to integrate both 
perspectives into a single, invariant, stochastic model - in terms of Differential 
Geometry and Dynamical Bayesian Networks (DBN). 

New types of active sensors, like Magnetic Field Sensors (MFS), have now 
evolved to provide fully 3D encoding of position and pose changes of moving 
objects (see, for example, the MIT GANDALF prograrrtJ). They provide us with 
more reliably ways of extracting shape trajectories (given an adequate calibration 
procedure) and they have already been used in the area of virtual reality HGI, 



^ http:/ /gn. www.media.mit.edu/groups/gn/projects 
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tele-robotics, film industry and, more recently, for more detailed studies of human 
kinematics 0. MFS have also been used to recognize Combat Signaltfl using 
3D signature correlation measures. However, there are still a number of open 
questions of interest to this paper. 

1. Encoding the shape and dynamics of complex actions. How to 

uniquely encode the trajectory shape and dynamics of complex actions in- 
variant to the absolute position and orientation of the action? 

2. Descriptions of complex actions. How to decompose complex actions 
into basic components which can apply in a consistent and complete fashion 
while also providing a rich symbolic description? 

3. Learning, recognizing and predicting complex actions. How to com- 
pile rules which define the ways in which many different types of complex 
actions are performed, recognized and predicted? 

Of particular interest is to study the last question, thus posing our research in 
terms of how to recognize and/or predict complex 3D action trajectories by sim- 
ply recording their total instantaneous velocity and acceleration dynamics. This 
is equivalent to learning forward kinematics, in contrast to the learning of dynam- 
ics from spatial trajectories - inverse kinematics. In this work we have explored 
these issues using MFS and, in particular, the Polhemus System^. However, the 
following discussion, treatment and algorithms equally apply to 3D feature data 
collected via passive vision sensors. 



2 Trajectory Shapes and Dynamics 

In our case we have, for each sensor and action, i, a recorded 3D trajectory 
defined by: 

C^{t) = {xi{t),y^{t),Zi{t)) ( 1 ) 

where x, y, z correspond to the cartesian coordinates of the sensor position, rel- 
ative to the transmitter origin, over time, t. It is well known that in order to 
compute derivatives of such data it is necessary to regularize them using mul- 
tiscaled operators. Past approaches have focused on gaussian pyramids (scale- 
space) methods for the encoding of contour features [Z|- 

Although such an approach produces useful computations of derivatives, it is 
not adaptive to the signal’s inherent local variations. More importantly, it does 
not, per se, offer a best-fitting approximation to the space curve, at a given scale. 
This is particularly relevant when there is a need to encode the trajectory and 
its dynamics at a given scale using physically referenced quantities like velocity 
and acceleration. For these reasons we have used a multi-scaled least-squares 
filter specifically designed to track the curve at a given scale: the Savitzky-Golay 
(SG) filter[3. The filter is derived as follows. 

^ http:/ /www. hitl.washington.edu/scivw/ JOVE/ Articles/dsgbjsbb.txt 
^ http://www.polhemus.com/ 
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We first denote the digital convolution of a signal f{t) with a filter c, as: 

n,R 

9t = ^ ' Cnft+n (2) 

n—riL 

where and correspond to point to the “left” and “ right” of the position 
t. A moving window averaging would correspond to c„ = l/(ni + ur + 1), a 
constant over all positions. However this “0-moment” filter, by definition, does 
not preserve higher order moments of the function. The SG filter is designed to 
preserve such higher order moments by approximating the underlying function 
within a moving window by a polynomial. Since the process of least squares 
fitting involves linear matrix inversion the coefficients of a fitted polynomial are 
linearly combined with the data values. This implies that the coefficients can be 
computed only once and applied by convolution procedures as defined above. 

The coefficients, a, are derived by (least squares) fitting a polynomial of 
degree M in t, namely, ag + ait + a 2 t^ -I- .. -I- amt^ , to the values f-ULJ /««• 
The design coefficient matrix. A, for this problem is defined by: 

Aij = i^;i = -nL,--,nR]j = 0,..,M (3) 

and the solution for the vector of aj’s is: 

a = {A^A)-^A^f (4) 

where and A~^ correspond to the transpose and inverse of the matrix. A, 

respectively. 

Due to the linearity of this solution we can project the com- 

ponent onto unit orthogonal vectors e„ = (0, 0, .., 1, 0, .., 0) with unity only in 
the position which corresponds to the window position in c„ (Eqn. (2)). 
Applying Eqn (4) to each basic vector results in the generation of what we term 
“SG filter” kernels which can be applied as standard convolution operators on 
the data. That is, 

Cn = (5) 

The result is a set of polynomial SG least squares smoothing filters which can 
be applied to each of the x{t), y{t),z{t) recordings of the form: 



X{t : n,m) = SG{n, m) * x{t) 


(6) 


Y(t : n,m) = SG{n, m) * y{t) 


(7) 


Z{t : n,m) = SG{n, m) * z{t) 


(8) 



where * denotes convolution. 

The most important benefit of such polynomial approximations is that 
higher-order derivatives can be determined algebraically from the derived poly- 
nomial coefficients. For the case of quartic polynomials, the first, second and 
third derivative forms can be directly computed as each defines a polynomial 
design matrix whose terms are shifted one to the left of the former - due to the 
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properties of polynomial differentiation. Such filters are well-known to preserve, 
to some extent, discontinuities and variations in the degrees of smoothness of 
the data. 

The magnitudes of velocity, V{t), acceleration, A{t) and displacement, D(t), 
for 3D motion can then be computed using these coefficients for each position 
parameter, {X{t),Y{t),Z{t)), as: 

V{t) = : n,m))2 -|- {Yt{t : + {Zt{t : n,m)Y (9) 

A{t ■.n,m) = yj {Xu{t : n, m)Y + {Yu{t : n, m))^ -|- {Zu{t : n, m))2. (10) 

In the following section we will see how they can also be used to compute the 
curvature and torsion values of a curve. 

In all, then, the SG filter satisfies a number of constraints required for the 
computation of dynamics and spatio-temporal trajectories by providing least 
squares filter for smoothing and differentiation in terms of polynomial filter ker- 
nels at any number of scales defined by the window size and order of polynomial. 

2.1 Invariant Signatures: kt — va Spaces 

There are many cases where it is necessary to encode action trajectories in ways 
which are invariant to their absolute position and pose. Consider, for exam- 
ple, situations where finger movements are used to communicate invariant to 
the pose and position of the hand(s): situations where only the relative motion 
of the movement trajectories are important. Current measures which use joint 
angles, relative displacements and feature ratios do not necessarily guarantee 
uniqueness, invariance and an implicit ability to reconstruct a given action. For 
such reasons we have explored intrinsic shape descriptors from Differential Ge- 
ometry: curvature (k) and torsion (r). These features have already been used by 
others for invariant shape descriptions 0- The curvature measures the amount 
of arc-rate of change of the tangent vector. From the unit tangent vector 

T{t) = Ct{t)/\\Ct{t)\\ (II) 

at a position on the trajectory, t, we can compute the curvature vector, Tt(t), 
whose magnitude is the curvature 

«(t) = ||T,(f)||. (12) 

Torsion, r(rt), measures the degree to which the curve departs from a planar 
path as represented by the plane defined by the tangent and normal vectors (the 
osculating plane^). The normal to a curve is the unit curvature vector: 

N{t) = Tt{t)/\\T,{t)\\ (13) 

and the binormal vector, the vector orthogonal to the tangent and normal is 
defined by: 



B{t) = Ct{t) X N{t). 



(14) 
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Torsion can then be computed as the scalar product (projection) of the arc-rate 
of change of the binormal and the normal vectors (the second curvature): 

T{t) = (15) 



These measures can be directly computed from the derivatives using the SG 
filters discussed above 0. 

The magnitude of curvature, at scale (n,m), is more readily computed as 
(dropping the scale parameters, n,m): 



ll'sWII = 



Yt{t)Ztt{t) - Zt{t)Yu{t) - 



Xt(t)Zu(t) - Zt(t)Xu{t) + Xt(t)Yu(t) - Yt(t)Xu{t) 

wmw 



(16) 



Torsion, r, at scale {n,m), is defined by (again, dropping the scale parameters, 
n, m): 

Xt{t) Yt{t) Zt{t) 

Xu{t) Yu{t) Zu{t) 

, , _ Xtu(t) Yut{t) Ztu(t) 

’ ~ S2(t)+F2(t)+G2(t) ^ ’ 

where 



m 



Yt{t) Zt{t) 
Ytt{t) Zu{t) 



m = 



Zt{t) Xt{t) 

Xtt(t) 



G{t) 



Xt{t) Yt{t) 

Xu{t) Ytt{t) 



(18) 

(19) 

( 20 ) 



where \Z\ denotes the determinant of matrix Z. 

Once computed these values determine the Serret-Frenet equations of a curve 
which defines how the trihedral vectors (T{t), N{t), B{t)) at position t changes 
(are transformed into their values at t -|- 1) by: 



Tt{t) = nN{t) 


(21) 


-n{t)T{t) + T{t)B{t) 


(22) 


Bt{t) = -T{t)N{t). 


(23) 



The Serret-Frenet equations provide proof that the shape of a curve can be 
uniquely determined up to its absolute position and orientation in 3D. That is, 
K and r define the transformation matrix which carries the moving trihedron 
from one point on a curve to the next. 

In all, then, we first define the SG-derived multi-scaled KT{t) curves as the 
locus of points in KT(t) space defining the values of K{t) and r(t) as a function 
of time, t, as shown in Figure 1. The dynamics are then defined by a similar plot 
of acceleration, a(t), and velocity, v(t), over the same temporal index - as also 
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Fig. 1. Complex action statics and dynamics. Top: shows sample 3D component action 
trajectory recorded from one sensor. The three axes are calibrated in mms. k,t (lower 
left) and va (lower right)- fully 3D velocity and acceleration signatures over time (scale: 
n=40,m=4). The grey level on each signature indicates the temporal evolution of the 
action. 



shown in Figure 1: curves in va{t) space. The benefits of this representation is 
that we can encapsulate the shape and dynamics of fully 3D actions via 2 simple 
contours in two two-dimensional spaces - invariant to their absolute position 
and pose in 3D. The additional benefits of such plots are that they visually 
demonstrate correlations and other properties between the components of shape 
and dynamics as well as identify “critical” points such as those where va(t) or 
KT{t) curves change their directions. 

Complex actions can then be uniquely encoded up to their absolute (spatio- 
temporal) position and orientation by: 

1. each component’s (z) invariant HT{t),va{t) signatures; 

2. the relative spatial (Sxij) and temporal {6tij) positions for the initial state 
(t = 0) of each pair (i,j) of action components; 

3. the relative direction of the initial tangent vectors (STij). 

Accordingly, a complex action composed of n > 2 components can be uniquely 
defined, up to their absolute position and pose, by n kt, va signatures and 8n(n— 
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l)/2 = 4n(n — 1) additional initial relational spatio-temporal position(six) and 
direction (two, a unit vector) values. 

Figure 2 shows the degree to which the computations of kt and va curves 
are invariant to the position and pose of the actions. In this case we have used 
the SG filter with a window size defined hy = ur = 20 and a fourth-oder 
polynomial fit, with sampling at 120 Hz. 













Fig. 2. Top row: same action in different positions and orientations. Middle and bottom 
rows: shows va and kt curves for each action. Note the invariance of their signatures 
(See text for details). 



Screw Decomposition. Although the above derivations provide invariant de- 
scriptions of the two components of complex actions (trajectory shape and dy- 
namics), they do not provide, as such, a symbolic representation or taxonomy of 
the action components which make practical sense to those who need to analyze 
and describe actions. To these ends we have developed, what we term, a “screw 
decomposition model” (SDM) to generate such descriptions. This approach is 
related to screw kinematics theory as developed over the past two centuries to 
encode mechanical motions in generaljTT]. This model differs from past work in 
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SO far as we use Screw Theory to encode motions. It follows simply from the 
observation that the trajectory of a “screw action” is helical and that a helix 
has constant curvature and torsion everywhere ■ Specifically, for a helix defined 
by: 

C{t) = (a cos t, a sin t , bt) (24) 



it can be shown that 



and 



a b 

q2 _|_ ^2 ’ ’’’ q2 _|_ ^2 






\b = 




(25) 

(26) 



These relations show how changes in curvature and torsion values define the 
types of instantaneous (or prolonged if constant for a period of time) screw 
actions (for example, “left-handed” and “right-handed”) which can approximate 
a given action component. Equally, temporally contiguous points which are close 
together on a kt curve can be approximated by a single screw action or helix. In 
particular, we define a “fixed point” in kt space is one that does not change for 
a given time interval over the evolution of the action trajectory and representing 
the trajectory of nearby points in kt space. That is, contiguity in local shape 
and time lie at the basis of our approach to encoding such actions. Consequently, 
our aim was to: 



— describe action components in terms of a sequence of screws (hashed curva- 
ture and torsion values) and so changes in screw “states” ; 

— determine the relationships between different types of screws and how they 
are performed - the screw dynamics (in va space); 

— explore how to compile rules (Machine Learning) which define the execution 
of complex motions via the above invariant decomposition model over all 
component actions. 



3 Learning, Recognizing, and Predicting Actions 

Since this work is not concerned with innovation models but, rather, encoding, 
recognition and prediction processes, we adopt a Machine Learning perspective 
for learning the relationships between the dynamics and trajectory of individ- 
ual action components, and their interdependencies. We explore how Dynamical 
Bayesian Networks (DBN), consisting of sets of coupled hidden Markov mod- 
els (HMMs) and initial position and pose information for each sensor, can be 
used to encode, predict and recognize complex actions. Each component HMM 
is used to model the relationship between the sensor’s invariant trajectory signa- 
ture(curvature and torsion) and its dynamics (velocity and acceleration). More 
formally, in this case we define our complex action DBN as 



A = {n,A,B} 



(27) 
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where 

n = K = p{S\y, i = 1, TV; M = 1, iVj (28) 

where N corresponds to the number of HMMs (nodes) in the network, Ni to the 
number of states in the i’th HMM. The generalized state transmission matrix, 
A, is defined by 

K = a^^,=p{Si{t+l)/Sl{t)) (29) 

where i,j correspond to a pair of HMMs and to their states. When i = j 
the state transitions are within a given HMM while when i yf j the state transi- 
tions apply between a pair of HMM states. In both cases we have used a single 
unit lag model (f to f -I- 1). However, in general, different types of lags can be 
used. It should be noted that A can also encode restricted interactive models by 
not allowing state transitions between specific HMMs and even specific states 
within and between HMM’s as well as unidirectional or bidirectional dependen- 
cies between t and t+1. In other words, A is the “causal model matrix” for the 
DBN. 

The final component of the DBN is the matrix B defined by 



B=hl{ol)=p{ol/Sl). (30) 

This corresponds to, for a given HMM, i, the probability of a given observation, 
k, that is o]., given the state u, S^, of HMM i. Finally, we define bu{o\) = p{o\/S\^) 
as the probability of a given observation for HMM i, at time, t. This term is 
necessary in developing the proposed Generalized Baum Welch (GBW) model 
(see below). 

Discrete HMM and DBN models assume a finite number of states and obser- 
vations and, as we will see, there are complications in using such discrete models 
as their defining characteristics affect performance in terms of encoding, gener- 
alizations, prediction and discrimination. Gonsequently, in this work (in contrast 
to most reported studies) we have explored how the number of states and ob- 
servations affect encoding and discrimination performance. Our preference is to 
develop continuous action mixture models - a topic under investigation in a re- 
lated project. Albeit, we use each discrete HMM to encode the dependency of 
trajectory shape, or the type of screw action (kt), on (observed) dynamics (va). 
In this case we have also combined velocity and acceleration into a single discrete 
dynamical observable variable. 

Another difference between this DBM model and past approaches is that, in 
this case, the states are not really “hidden”. Here the HMMs are being used to 
encode the relationships between two classes of observable variables, and we have 
training data where states and observations can be extracted in parallel during 
the training phase. It is therefore possible for us to generate initial estimates 
for each HMM using a standard moving window method. Each process in this 
initial estimation scheme is defined as follows. 

1. KT quantization into screw (action) states: Si, S^, using percentile- 

based attribute quantization over k and t values. 
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2. va quantization into dynamic “symbols”: o{va\) = oi, o{vak), o{vaMi) = 
OMi, again, using percentile-based attribute quantization of v and a values. 

3. Using the moving window method we obtain, from the training data, initial 

estimates of each type of action component HMM, from w = = 



T 

p{S:) = J2USl^it))/T ( 31 ) 

t 

T 

p{sm/si{t -i)) = Y. + r - l))/r (32) 

r— 1 

T 

p{ol/Sl) = lUoUt)/Sl{t))/T (33) 

t 

where T corresponds to the length of the observation sequence, and 

r / N f 1 iff z = i 

otherwise ^ 

These initial estimates are then used as input to the Baum- Welch procedure 
which, via Expectation-Maximization, adjusts the complete DBN model to fit 
the training data and the inherent non-stationary performance characteristics of 
such models. 



3.1 Estimation Update and Prediction 

Estimation: The Generalized Baum Welch Algorithm. As a generaliza- 
tion of the single HMM case|TD] we define the generalized forward operator as: 

MSI) = p{Sl)p{o\/Sl) = <6„(ol) (35) 

o*+i(5;) = ^^at(5;)a(/„6(oj+i/.S:) + ^^a*(54)a(f^6(o^75i) (36) 

i=j « i^j “ 

The first component of the right hand side of Equation (36) corresponds to the 
intra-HMM forward operator while the second to the inter-HMM components. 
Together they can be represented in a single matrix form as: 



<^t+i — ottABt (37) 

which encodes all possible inter and intra state transitions within the given model 
as a function of the causal model. In this case the diagonal block submatrices of A 
correspond to the intra-HMM state transitions while the outer rectangular blocks 
correspond to state transition weights between each HMM. Accordingly we can 
model the DBN by blocking specific types of inter and intra state transitions. It 
is for this reason that we term A the “causal model” matrix. 
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In a similar way to the forward operator, the generalized backward operator 
can be defined, in matrix form, as: 

/3,=A+iA'bt (38) 

where A corresponds to the transpose of A and, Vzj, PriSj) = 1. 

As with the single HMM Baum Welch (EM) procedure we first estimate 
the expected state transitions between any two states within a given model. 
As defined above this incorporates influences of other HMM states within and 
between the HMMs according to the DBN causal model. The net result, however, 
has the same format as the single HMM case, where, at a given time point, we 
have: 





= atA(3t+i 


(39) 


Integrating over t 


we obtain: 






T 


(40) 






(41) 


and 


j.v 3 


(42) 


and so 


3,v j 

T 

B = K(oi) = Y.p{Slit)/ol{t)/{T^l{t)) 


(43) 



The above formulation allows us to consider a number of algorithms as a 
function of the degree to which the intra- and inter- state dependencies are 
estimation in parallel or sequentially. The fully parallel Generalized Baum 
Welch algorithm is as follows: 

Parallel GBW Estimation Method 

1. Select a causal model for A by excluding possible dependencies. 

2. Generate initial estimates of A from the training data using the moving 
window method (see above). 

3. Re-estimate DBN as A = {n. A, B } 

4. If A ~ A STOP 

5. Set A = A and GoTo 2. 

However, we have used an equally viable, and computationally less demanding, 
sequential (“residual”) implementation of the Generalized Baum Welch opera- 
tors - while continuing to investigate efficient ways of implementing the fully 
parallel version. 
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Residual GBW Estimation Method 

1. Select a causal model for A by excluding possible dependencies. 

2. For each sensor, i: 

- Generate initial estimates for each sensor HMM A" from the training data 
using the moving window method. 

- Re-estimate the sensor’s HMM parameters till convergence using the stan- 
dard Baum Welch algorithm. 

3. Then compute the inter-HMM state dependencies in accord with the causal 
model (off-diagonal blocks of and an associated predictor method). 

Prediction: The Generalized Viterbi Algorithm. Analogous to the single 
HMM Viterbi algorithm, the generalized method predicts the optimal set of 
state sequences which fits the observations. It is based upon computing the 
most likely state sequence for each of the component HMMs. This is well-known 
to be computationally expensive and so less optimal solutions are typically 
used, in particular, where only the previous best set of states are retained in 
determining the appropriate state at a given time. That is: 

Algorithm 

Initialization 



For l<i<N,l<u<Ni 
S,{Sl) = 7r5„(o)‘) 

^i{s:) = o 

Recursion 

For 2<t<T, l<j < N, l<v < Nj 
= argmaXij.u,v[^t-i{Sl)a]^J 

Using the more efficient residual DBN estimation model, we simply used a cor- 
responding sequential state estimation approach. That is, the Viterbi algorithm 
was applied to each individual sensor and then a second Viterbi procedure was 
used to estimate the degree to which one most likely sensor state sequence could 
predict another from the off-diagonal blocks of the causal model matrix, A {i jtz j 
components of Eqn (36)). This was implemented by simply defining the predic- 
tor state sequence as the “inter-HMM” state sequences and the dependent state 
sequence as corresponding to the equivalent “observation sequence” (see below) . 

4 Assessing DBN Performance 

The Generalized Viterbi algorithm defines the most likely state sequence for each 
node of the DBN in terms of the final (joint) posterior maximum likelihood (log) 
probability of the state sequences given the observation sequences. This latter 
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measure is a reasonable measure for pattern recognition but not necessarily the 
most representative way of measuring how well the model encodes, predicts or 
can track the training data. For these reasons we have used the Viterbi-derived 
posterior maximum likelihood (log) probability for classifying actions and a dif- 
ferent measure to represent how well the action is encoded via a given DBN. This 
latter measure is based upon computing the Hamming distance between observed 
and predicted observation sequences using a Monte Carlo method. That is, for 
each sensor and for each task/participant, we generate predicted observation 
sequences by randomly selecting observations according to the estimated DBN 
model probabilities and the Viterbi-estimated state sequences. This is computed 
a number of times to result in a mean and standard deviation of the similar- 
ity between observed and predicted observation sequences using the reversed 
normalized Hamming distance defined by: 



where 



T 

= J2lo{t){o{t))/T 
t 






1 iff d{t) = o{t) 

0 otherwise. 



(44) 



(45) 



This constitutes a direct measure of the likelihood that the particular set of 
observations sequences (one per sensor or node of the DBN) matches those pre- 
dicted from the complete model using the Viterbi solutions for each of the node’s 
predicted state sequences. Comparing these values indicates the uniqueness of 
the Viterbi solutions over the complete DBN. This measure is useful as it pro- 
vides a direct estimate (/i, cr) of how well the model can encode the training data. 
However, it does not provide a way of determining the discriminatory power of 
the DBN model for a given action in predicting the data from the model. For this 
reason we have also introduced an additional component to the measure - how 
well a given DBN model can predict observations from data known not to arise 
from the model (“lo) - that is, a different task or participant, etc. This compari- 
son results in the prediction-discrimination function (PDF) which clearly varies 
as a function of the number of states(ris) and observations(no): 

PDF{ns,no) =?„^,„„(d,o) - ‘rn,.no(d, -■o) (46) 

This is analogous to the “detectability” score in signal detection theory: p(“Hit”) 
- p( “False Alarm”) for the case of the ideal detector jSj. Again, -lo corresponds 
to observations not from the training set. This measure of “encoding-relative-to- 
discrimination” penalizes the Hamming Distance measure as a function of the 
degree to which the HMM can equally track observation sequences not generated 
from the target. 
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5 Experiments and Results 

Although the above model defines an invariant stochastic model for encoding, 
learning and predicting complex actions it’s discrete formulation presents a num- 
ber of empirical issues. To study its performance we have considered four “as- 
sembly” and “disassembly” tasks as illustrated in Figure 3 below. In all, three 
of the authors performed each task 10 times using the 4 sensors attached to 
the forearms and hands as shown in Figure 3. Each task was performed using 
the same initial resting position of the arms and hands and at a fixed sitting 
position at a table upon which the parts were laid out in identical positions. 
Each participant was allowed some rehearsal trails before commencing the for- 
mal recordings. All data was collected for each participant over one recording 
session. This resulted in: 4(tasks) x 4(sensors) x lO(trials) x 3 (participants) = 
480 data steams sampled at 120Hz for approximately 20 seconds each - approxi- 
mately 2,000 X 480 data points or 2 Megabytes of data. This data was smoothed 
and sub-sampled to 12 Hz for a number of reasons. One, the shear amount of 
data needed to be reduced for real-time computational purposes. Two, since our 
model involves differential invariants, it was more likely for the state transition 
matrices to be less redundant as the sampling decreased. 









Fig. 3. Shows 4 construction tasks for each of the two construction tasks. The remaining 
two tasks consisted of disassembling each object. All tasks were initiated from the same 
hand and arm resting positions shown in column one. 



Half of the trials were used for training the model, half for testing on unseen 
examples. 

In order to ascertain the optimal numbers of states and symbols over all tasks, 
trials and participants, we generated a large number of DBN solutions varying 
in the numbers of clusters in va and kt attributes. In both cases the clustering 
“resolution” was determined in terms of a generalized percentile splitting method 
on each attribute resulting in a binning or quantization of the respective spaces 
based upon equal frequencies. This type of binning is efficient though not neces- 
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sarily optimal. However, 16 states of act and 9 discrete observation values of va 
were found to optimize the the prediction-discrimination function (PDF). 

With these numbers of observations and states, we then determined, for a 
given task and participant, the degrees to which each sensor observations could 
be predicted from their own training data using only each sensor’s individual 
HMM model. Over all four tasks, participants and sensors results showed that 
we could correctly and consistently predict observed sensor dynamics on average 
85% (0.85) of the time with a standard deviation of ±10% (0.1) using the Viterbi 
generated optimal state sequences. 

We then measured the degree to which each arm sensor states could predict 
each hand sensor states by the following procedure. The 16 x 16 sensor state 
dependencies do not provide a direct measure of the degree to which one sensor 
state sequence could predict those of another - particularly when there is no 
“ground truth” . Consequently we adopted an alternate strategy to examine such 
correlations. Since we were concerned with analyzing the possible dependencies of 
the hand sensor states on the arm sensor states, we defined the Viterbi-predicted 
hand state sequence as dependent state “observations” relative to the arm sensor 
state sequence. We then estimated an equivalent inter sensor HMM based on the 
off-diagonal estimates of the causal model matrix A. From its Viterbi solution 
we could predict, using the Monte Carlo method and the same reverse Hamming 
distances, the degree to which each arm state sequence could predict each hand 
state sequence at four different time lags of f = 0, —1, —2, —3. From this analysis 
we found the best prediction occurred with zero lag, with a 55% prediction rate 
between the sensors. This is highly significant since the random performance 
would be at 6% (1/16) performance level. 



6 Discussion 

In this project we have investigated a number of issues related to the encod- 
ing, prediction and recognition of complex human actions. Of particular interest 
has been how to formulate the recognition of complex actions, their forward 
and inverse kinematics in terms of sets of hidden Markov models or Dynamical 
Bayesian Networks. These initial investigations show that this type of model has 
significant potential in so far as it incorporates variability in action performance, 
estimation, prediction and classification all within the same framework. Open 
issues still include those pertaining to optimal estimation of the complete DBN 
model and how to implement the generalized Baum Welch and Viterbi-type 
methods. Past work on estimation jS| illustrates the complexity of this problem 
even with discrete models and in this work the residual analysis approach to the 
problem provided useful insights into the types of dependencies existing between 
limb segments. 

What is also needed, for this approach to be more robust, is to replace the 
current discrete state and observation variable models with mixture or related 
models. The clustering methods currently used for generating discrete state and 
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observation values have demonstrable positive and negative affects on HMM 
performance as a function of the number and types of states and observations. 

The use of Differential Geometry for uniquely encoding action trajectories 
in invariant ways and in a regularized fashion has proved useful and also has 
potential integrating more formally with Screw Theory in Kinematics ll2l . The 
SG filters have proven useful as they not only provide efficient ways for comput- 
ing derivatives but also provide filters which can reproduce the physical values 
required to maintain validity of velocities, acceleration, etc. 

In all, then, this paper proposes an approach to the analysis, learning and 
prediction of the shape of complex human actions. We have explored a unique 
but invariant coding scheme and a method for encapulating the variations and 
interactions which occur in complex actions via Dynamical Bayesian Networks. 
Together they illustrate how Stochastic Differential Geometry can be a powerful 
tool for the future analysis of complex kinematic systems. 
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Abstract. Visual forms come in countless varieties, from the simpli- 
city of a sphere, to the geometric complexity of a face, to the fractal 
complexity of a rugged coast. These varieties have been studied with 
mathematical tools such as topology, differential geometry and fractal 
geometry. They have also been examined, largely in the last three de- 
cades, in terms of mereology, the study of part-whole relationships. The 
result is a fascinating body of theoretical and empirical results. In this 
paper I review these results, and describe a new development that ap- 
plies them to the problem of learning names for visual forms and their 
parts. 



1 Introduction 

From the anatomy and physiology of the retina, we know that the processes 
of vision begin from a source that is at once rich and impoverished: photon 
quantum catches at 5 million cones and 120 million rods in each eye [1]. This 
source is rich in the sheer number of receptors involved, the dynamic range of 
lighting over which they operate, and the volume of data they can collect over 
time. This source is impoverished in its language of description. The language 
can only state how many quanta are caught and by what receptors. It can say 
nothing about color, texture, shading, motion, depth or objects, all of which are 
essential to our survival. For this reason we devote precious biological resources — 
hundreds of millions of neurons in the retina and tens of billions of neurons in the 
cerebral cortex — to construct richer languages and more adaptive descriptions 
of the visual world. 

A key criterion for these more adaptive descriptions is that they allow us to 
predict, with economy of effort, future events that can affect our survival. We 
construct a world of objects and their actions, because carving the world this 
way lets us quickly learn important predictions. Running toward a rabbit leads 
to predictably different results than running toward a lion. These are important 
object-specific properties that cannot be learned in the language of quantum 
catches. 

We carve the world more finely still, dividing objects themselves into parts. 
Parts aid in the recognition of objects. Parts also allow more refined predictions: 
If, for instance, one is fighting a conspecific it might be critical to attend to 
certain parts, such as arms or legs or jaws, and relatively safe to ignore other 
parts such as ears. Moreover, some parts of shapes are better remembered than 
others [2-4] . The centrality of parts to human vision can be seen in the following 
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six figures, each of which can be explained, as we will see shortly, by three rules 
for computing the parts of objects. 

In Figure 1 you probably see hill-shaped parts with dashed lines in the valleys 
between them. But if you turn the figure upside down, you will see a new set of 
hills, and now the dashed lines lie on top of the new hills [5]. 

In Figure 2, which of the two half moons on the right looks most similar to 
the half moon on the left? In controlled experiments almost all subjects say that 
the bottom looks more similar to the half moon on the left — even though the 
top half moon, not the bottom, has the same bounding curve as the half moon 




In Figure 3, most observers say the staircase on the right looks upside down, 
whereas the one on the left can be seen either as right side up or as upside down 

[ 7 ]. 



In Figure 4, the display on the left looks transparent, but the one on the 
right does not [8]. The luminances in the two cases are the same. 




(a) Figure 3. w Figure 4. 



In Figure 5, the symmetry of the shape on the left is easier to detect than 
the repetition of the shape on the right [9-12]. 

In Figure 6, a heart shape pops out among a set of popcorn-shaped distrac- 
tors, as shown on the left, but not vice versa, as shown on the right [13,14]. 




(a) Figure 6. w 
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Figure 6 suggests that we begin to construct parts, or at least boundaries between 
parts, early in the stream of visual processing. Empirical evidence in support of 
this suggestion comes from experiments using visual search [13,14] with stimuli 
like Figure 6. In one experiment conducted by Hulleman, te Winckel and Bo- 
selie [13], subjects searched for a heart-shaped target amidst popcorn-shaped 
distractors, and vice versa. The heart had a concave (i.e., inward pointing) cusp 
which divided it into two parts, a left and right half. The popcorn had a convex 
(outward pointing) cusp and no obvious parts. Its convex cusp was chosen to 
have the same angle as the concave cusp of the heart. The data indicate that 
subjects search in parallel for the heart targets, but search serially for the pop- 
corn targets. It appears that parts are important enough to the visual system 
that it devotes sufficient resources to search in parallel for part boundaries. This 
early and parallel construction of part boundaries explains why parts affect the 
perception of visual form in the wide variety of ways illustrated in Figures 1 
through 6. 

2 The Minima Rule 

Recent experiments suggest that human vision divides shapes into parts by the 
coordinated application of three geometric rules: the minima rule, the short-cut 
rule, and the part salience rule. The rules are as follows: 

• Minima Rule (for 3D Shapes): All concave creases and negative minima 
of the principal curvatures (along their associated lines of curvature) form 
boundaries between parts [5,7]. 

• Minima Rule (for 2D Silhouettes): For any silhouette, all concave cusps 
and negative minima of curvature of the silhouette’s bounding curve are bo- 
undaries between parts [5,7]. 

• Short-cut rule: Divide silhouettes into parts using the shortest possible cuts. 
A cut is (1) a straight line which (2) crosses an axis of local symmetry, (3) 
joins two points on the outline of a silhouette, such that (4) at least one of the 
two points has negative curvature. Divide 3D shapes into parts u,sing minimal 
surfaces [15]. 

• Salience rule: The salience of a part increases as its protrusion, relative area, 
and strength of part boundaries increases. [7] . 

Together these rules can explain each visual effect illustrated in Figures 1-6. In 
Figure 1, the straight lines in the valleys are negative minima of the principal 
curvatures and therefore, according to the minima rule, they are part bounda- 
ries. When you turn the illustration upside down this reverses figure and ground, 
so that negative minima and positive maxima of curvature reverse places. The- 
refore, according to the minima rule, you should see new part boundaries and 
new parts [5]. A quick check of the illustration will confirm this prediction. 

In Figure 2, the half moon on the top right has the same contour as the half 
moon on the left. Yet most observers pick the half moon on the bottom right 
as more similar to the half moon on the left, even though its contour is mirror 
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reversed and two of the minima parts have switched positions. This is explained 
by the minima rule because this rule carves the bottom half moon into parts 
at the same points as the half moon on the left, whereas it carves the top half 
moon into different parts [6]. Apparently shape similarity is computed part by 
part, not point by point. 

In Figure 3 the staircase on the right looks inverted, because the parts defined 
by the minima rule for the inverted interpretation have more salient part bo- 
undaries (sharper cusps) than for the upright interpretation. Other things being 
equal, human vision prefers that choice of figure and ground which leads to the 
most salient minima-rule parts [7]. 

In Figure 4 we see transparency on the left but not on the right, even though 
all the luminances are identical. The reason is that on the right there are minima- 
rule part boundaries aligned with the luminance boundaries, so we interpret 
the different grays as different colors of different parts rather than as effects of 
transparency [8]. 

In Figure 5 we detect the symmetry on the left more easily than the repetition 
on the right, an effect noted long ago by Mach [12]. The minima rule explains 
this because in the symmetric shape the two sides have the same parts, whereas 
in the repetition shape the two sides have different parts [9]. Again it appears 
that shapes are compared part by part, not point by point. 

In Figure 6 the heart pops out on the left, but the popcorn does not pop 
out on the right, even though the two have cusps with identical angles [13]. The 
minima rule explains this because the concave cusp in the heart is a part bo- 
undary whereas the convex cusp on the popcorn is not. Minima part boundaries 
are computed early in the flow of visual processing since parts are critical to the 
visual representation of shape. 

The minima rule makes precise a proposal by Marr and Nishihara [16] that 
human vision divides shapes into parts at “deep concavities” . The minima rule 
has strong ecological grounding in the principle of transversality from the held 
of differential topology. This principle guarantees that, except for a set of cases 
whose total measure is zero, minima part boundaries are formed whenever two 
separate shapes intersect to form a composite object, or whenever one shape 
grows out of another [5,7]. 

3 Other Part Rules 

Given the central role of parts in object perception, it is no surprise that several 
theories of these parts have been proposed. One class of theories claims that 
human vision uses certain basic shapes as its definition of parts. Proponents of 
basic-shape theories have studied many alternatives: polyhedra [17-19], genera- 
lized cones and cylinders [16,20], geons [21,22], and superquadrics [23]. Of these, 
the geon theory of Biederman is currently most influential. Geons are a special 
class of generalized cylinders, and come in 24 varieties. The set of geons is deri- 
ved from four nonaccidental properties [21,24,25]. These properties are whether 
(a) the cross section is straight or curved; (b) the cross section remains constant. 
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expands, or expands and contracts; (c) the cross section is symmetrical or asym- 
metrical; and (d) the axis is straight or curved. These properties are intended 
to make recognition of geons viewpoint invariant, although some experiments 
suggests that geon recognition may nevertheless depend on viewpoint [26]. 

A second class of theories claims that human vision defines parts not by 
basic shapes but by rules which specify the boundaries between one part and its 
neighbors. The minima rule is one such theory. Another is the theory of “limbs” 
and “necks” developed by Siddiqi and Kimia [27], building on their earlier work 
[28] . They define a limb as “a part-line going through a pair of negative curvature 
minima with co-circular boundary tangents on (at least) one side of the part- 
line” ([27], p. 243). Their “part-line” is what I call a “part cut.” Two tangents 
are “co-circular” if and only if they are both tangent to the same circle ([29], 
p. 829). Siddiqi and Kimia define a neck as “a part-line which is also a local 
minimum of the diameter of an inscribed circle” ([27], p. 243). 



4 Naming Parts 



Figure 7a shows a peen. After looking at the figure, you know that a peen is the 
part of a hammer that is shaded gray in Figure 7c. You also are sure that a peen 
is not the part shaded gray in Figure 7b. This exercise in ostensive definition is 
easy for us; we discover the meaning of peen quickly and without conscious effort. 
Yet in several respects our performance is striking. When we view the hammer 
in Figure 7a and guess the meaning of peen, we discard countless possibilities 
since, as is hinted by Figure 7b, there are countless ways to partition the hammer 
or any other shape. But despite these countless possibilities we all pick the same 
part. Deductive logic alone does not compel us to choose a unique part from 
the figure. Our choice must be constrained by rules in addition to the rules of 
deductive logic. In principle many different rules can yield a unique choice. But 
since we all pick the same part, it is likely we all use the same rules. What rules 
then do we use that make us all pick the same part when someone points and 
names? 





(b) 

Figure 7. 




(c) 



We propose the following hypothesis: 



Minima-Part Bias. When human subjects learn, by ostensive definition, 
names for parts of objects, they are biased to select parts defined by the 
minima and short-cut rules. 
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The motivation for this hypothesis is that the units of representation computed 
by the visual system and delivered to higher cognitive processes are natural 
candidates to be named by the language system. I propose that parts defined by 
the minima rule are such units. As an example, in Figure 8a is a curve, which 
is ambiguous as to which side is figure and which is ground. If we take the 
ground to be on the left, as in Figure 8b, then the minima rule gives us the part 
boundaries that are indicated by the short line segments. As you can see in Figure 
8b, each part boundary lies in a region of the curve that is concave with respect 
to the figure (or, equivalently, convex with respect to the ground), and each part 
boundary passes through the point with highest magnitude of curvature within 
its region. If we take the ground to be on the right, as in Figure 8c, then the 
minima rule gives us a completely different set of part boundaries, indicated 
by the new set of short line segments. The reason is that switching figure and 
ground also switches what is convex and concave, and the minima rule says to 
use only the concave cusps and concave minima of curvature as part boundaries. 





We can use the curve in Figure 8a to create the well-known face goblet illusion, 
as shown in Figure 9a. 




One can see this either as a goblet in the middle or two faces on the sides. If we 
see the faces as figure, then the minima rule gives the part boundaries shown on 
the right side of Figure 9b. These part boundaries divide the face, from top to 
bottom, into a forehead, nose, upper lip, lower lip, and chin, as labeled in the 
figure. If instead we see the goblet as figure, then the minima rule gives the part 
boundaries shown on the left side of Figure 9b. These part boundaries divide the 
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goblet, from top to bottom, into a lip, bowl, stem (with three parts), and base. 
Note that the parts defined by the minima rule are the parts we name. Other 
parts are not named. For instance, take the bowl part of the goblet and ask what 
we would call this same part on the face (where it is not a part defined by the 
minima rule) . We would say it is “the lower part of the forehead and the upper 
part of the nose.” This is not a name but a complex description of an unnamed 
part of the face. Thus the parts that we name on the face and the goblet are 
precisely the parts derived from the minima rule. 

The outline shape of an object can play an important role in its recognition 
across depth rotations [30] . To test the minima-part bias with such outline sha- 
pes, Rodriguez, Nilson, Singh and Hoffman generated five random silhouettes 
each having five minima parts [31]. On each trial the observer saw one silhou- 
ette with an arrow pointing to it. Orientations of the silhouettes were changed 
from trial to trial. The observer was instructed that the arrow pointed to “a 
dax” on the shape. The syntax of count nouns was used in these instructions to 
direct the observer’s attention to shape rather than substance interpretations of 
“a dax” [32-35]. On each trial the observer was given three choices, displayed 
in random order, for the dax: (1) a minima part, (2) a maxima part, and (3) a 
convex part cut at inflections. The arrow was placed so that it pointed towards 
the inflection, in order to minimize possible biases due to the position of the 
arrow. They found that minima parts were chosen as the dax about 75% of the 
time, far more frequently than the maxima or inflections. 

To further control for possible biases of the arrow position, a second expe- 
riment eliminated the arrow. Instead observers were instructed that the shape 
had a “dax” near its top. On each trial the silhouette was rotated so that the 
3 parts of interest (minima, maxima, inflections) were near the top, but never 
precisely vertical. They again found that minima parts were chosen about 75% 
of the time. 

The minima-part bias makes a striking prediction. Suppose a subject can see 
a shape undergo reversals of figure and ground. Then the parts of that shape that 
the subject will name should also change each time figure and ground reverse. 
The reason is that the minima rule defines part boundaries only at concave 
regions of an object. Thus when figure and ground reverse so also do concave and 
convex, so that subjects should see a new set of part boundaries. This prediction 
was illustrated in Figures 8 and 9. Rodriquez et al. tested this prediction of 
the minima-part bias, using a simple method to induce a figure-ground reversal: 
global reversal of contrast together with enclosure [31]. Subjects viewed a shape, 
with an arrow pointing toward a “dax” on the shape, and picked the part that 
looked most natural to be the “dax” . On a different trial they saw precisely the 
same shape, with the arrow pointing in exactly the same way, but with reversed 
contrast. This reversed contrast induced subjects to reverse figure and ground. 
In this case the minima-part bias predicted that they would pick a different 
part, one with the new minima of curvature for its part boundaries, as the most 
natural “dax” . The shapes used were five random curves each having three or 
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four minima parts. Observers chose between minima, maxima, and inflection 
options as before. The results were again as predicted by the minima-rule bias. 

Rodriguez et al. also devised at test in which the predictions of the geon 
theory differ dramatically from those of the minima-rule bias [31]. They used a 
two-alternative forced-choice paradigm in which subjects saw two objects side 
by side and were told that one of the objects had a dax on it. Subjects had to 
choose which of the two objects had the dax on it. For most pairs of objects that 
were shown to the subjects, the minima rule and geon theory predicted opposite 
choices. Each of the objects was composed of two shapes. The first, the base 
shape, was either an elongated box or a cylinder. The second was one of twelve 
shapes, six of which were geons and six of which were nongeons. The geons were 
(1) a curved box or cylinder, (2) a tapered box or cylinder, and (3) a curved 
and tapered box or cylinder. The six nongeons were created from the geons by 
smoothly changing the cross section from square to circular or vice versa as 
it sweeps along the axis. Each of these twelve shapes was attached to it base 
shape in one of two ways: (1) with a minima part boundary at the attachment 
or (2) with no minima boundary at the attachment. This led to a total of 18 
composite objects. Thus there were four types of objects, defined by the type 
and attachment of the second shape. These were (1) geons with minima, (2) 
geons without minima, (3) nongeons with minima, and (4) nongeons without 
minima. We can label these, respectively, -l-G-l-M, -l-G-M, -G-l-M, and -G-M. In 
a two-alternative forced-choice paradigm there are ( 2 ) = 6 ways of presenting 
pairs of these objects. These ways are listed below, together with the predictions 
of the geon theory and the minima rule as to which of the two objects is most 
likely to be chosen as the one having the dax. 



GASE OBJEGT 1 OBJEGT 2 GEON MINIMA 



1 


+G+M 


-kG-M 


same object 1 


2 


+G+M 


-G-kM 


object 1 same 


3 


-kG-kM 


-G-M 


object 1 object 1 


4 


-kG-M 


-G-kM 


object 1 object 2 


5 


-kG-M 


-G-M 


object 1 same 


6 


-G-kM 


-G-M 


same object 1 



As you can see from this table, in five of the six cases the minima rule and the 
geon theory make different predictions. Only in the third case do their predictions 
agree. Rodriquez et al. used all cases in a two-alternative forced-choice paradigm 
to test which theory correctly predicts observers’ choices. They found that where 
the predictions disagreed, each subject chose overwhelmingly in accord with the 
minima bias and not in accord with the geon theory. 

In a final experiment, Rodriguez et al. found that subjects also use the 
minima-part bias in generalizing the names of parts. On each trial they sho- 
wed a subject an object with a single attached part, and told the subject that 
the object had a “dax” on it. Then they showed the subject that same object, 
but with the part transformed either by translation, scaling, or both translation 
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and scaling. On some trials a minima part was transformed, on others a maxima 
part, and on others an inflection part. Subjects had to decide if this new object 
also had a “dax” on it. Subjects did generalize the part name to transformed 
minima parts, but not to transformed maxima or inflection parts. This indicates 
that the minima-rule bias guides both the initial attachment of names to parts, 
and the generalization of part names to new parts. 

5 Conclusion 

Visual form is not available at the retina, but must be constructed by tens of bil- 
lions of neurons in the visual system. The description of visual form requires the 
visual system to carve the visual world into objects, and to carve these objects 
even more finely into parts. Human vision apparently does this by construc- 
ting minima-rule part boundaries early in the course of visual processing. These 
boundaries, together with geometric rules for constructing part cuts which use 
these boundaries, leads to an articulation of objects into parts. The potential 
shapes of these parts are countless, and are not limited to a preordained set such 
as geons or generalized cylinders. These parts are among the fundamental units 
of visual description that are delivered to higher cognitive processes, including 
language. As a result, in language acquisition humans employ the minima-part 
bias when learning the names of parts by ostensive definition. The minima-part 
bias has so far only been tested in adults. It will be of interest to see if young 
children also employ this bias when learning names for parts. 
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Abstract. An experimental comparative study of three matching meth- 
ods for the recognition of 3D objects from a 2D view is carried out. The 
methods include graph matching, geometric hashing and the alignment 
technique. The same source of information is made available to each 
method to ensure that the comparison is meaningful. The experiments 
are designed to measure the performance of the methods in different 
imaging conditions. We show that matching by geometric hashing and 
alignment is very sensitive to clutter and measurement errors. Thus in 
realistic scenarios graph matching is superior to the other methods in 
terms of both recognition accuracy and computational complexity. 



1 Introduction 

Object recognition is one of the crucial tasks in computer vision. A practical 
solution to this problem would have numerous applications and would greatly 
impact on the field of intelligent robotics. In this paper we are concerned with 
the problem of finding instances of 3D objects using a single 2D image of the 
scene. A frontal image of each object is used as the object model. 

In a model based object recognition there are two major interrelated problems, 
namely that of object representation and the closely related problem of object 
matching. A number of representation techniques have been proposed in the 
computer vision literature which can be broadly classified into two categories: 
feature based, and holistic (appearance based). We shall not be dismissive of the 
appearance based approaches as they possess positive merits and no doubt can 
play a complementary role in object recognition. However, our motivation for 
focusing on feature based techniques is their natural propensity to cope better 
with occlusion and local distortion. 

The matching process endeavours to establish the correspondence between the 
features of an observed object and of a hypothesised model. This invariably 
involves the determination of the object pose. The various object recognition 
techniques proposed in the literature differ in the way the models are invoked 
and verified. The techniques range from the alignment methods 13 13 where 
the hypothesised interpretation of image data and the viewing transformation is 
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based on the correspondence of a minimal set of features. The candidate inter- 
pretation and pose are then verified using other image and model features. The 
other end of the methodological spectrum is occupied by geometric hashingjTj 
0 or hough transform methods dSI where all the scene features are used jointly 
to index into a model database. This approach is likely to require a smaller 
number of hypothesis verifications. However its success is largely dependent on 
the ability reliably to extract distinctive features. The verification stage in both, 
alignment and geometric hashing methods involves finding a global transforma- 
tion between scene object and model. 

In contrast, the philosophy behind graph matching methods is to focus on lo- 
cal consistency of interpretation PI21 In an earlier work |3] we developed a 
recognition method in which test and model image regions are represented in 
the form of relational attributed graphs. The graph matching is accomplished 
by the probabilistic relaxation labelling method UK which has been adapted for 
this application. 

The aim of this paper is carry out an extensive experimental comparative study 
of the geometric hashing, alignment and graph matching methods. The same 
source of information is made available to each method to ensure that the com- 
parison is meaningful. The experiments are designed to measure the performance 
of the methods in different imaging conditions. The experimental results show 
that matching by geometric hashing and alignment is very sensitive to clutter 
and measurement errors. We shall argue that the success of the graph matching 
approach stems from the fact that the model and scene images are compared by 
considering local matches. This prevents the propagation of errors throughout 
the image. In contrast, error propagation plagues the alignment and geometric 
hashing methods which perform the scene/image model comparison in a global 
coordinate system. Thus we shall show that, in realistic scenarios, graph match- 
ing is superior to the other methods in terms of both recognition accuracy and 
computational complexity. 

We begin by reviewing the methods to be compared. In section 01 we describe 
the experiments designed for the comparison of the methods. The results of the 
experiments are reported in section 0 Finally we draw the paper to conclusion 
in the last section. 

2 Methodology 

In order to make a comparative study of matching approaches meaningful it is 
essential that all methods use the same information for the object comparison. 
Thus in our study all methods deploy identical object representation. In this 
regard we consider an object image, which either serves as an object model or 
captures a scene containing unknown object(s) to be interpreted, as a collection 
of homogeneous planar regions obtained by segmenting the input image. In the 
former case we refer to it as the model image whereas in the latter we call it a test 
image. In our representation each region of the image is described using region 
colour (in the YUV system), region area, region centroid and a number of high 
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curvature points extracted from the region boundary. Furthermore the boundary 
of each region is used to characterise its shape. In the following subsections we 
explain how each recognition method employs the given information. 

2.1 Geometric Hashing and Alignment 

In geometric hashing and alignment the recognition task is considered as the 
problem of finding the best transformation which matches object features in the 
test image to the corresponding features in the object model. The transformation 
parameters are computed using a minimum number of corresponding features in 
the model and test image planes. Assuming the transformation is affine, it can be 
determined from the knowledge of three corresponding points in the respective 
test and model image planes. The two methods differ in the way the model-test 
triplets are selected to generate transformation hypotheses. 

Alignment Method: In the alignment method the test image is aligned against 
a candidate model image using all the possible pairs of model-test triplets. Each 
of these pairs defines a transformation between the two planes associated with 
the triplets|2|. The validity of each generated hypothesis is initially assessed at a 
coarse level by quantifying the goodness of match between a small set of features 
of the candidate model and the test image. A large number of hypotheses will be 
eliminated at this stage. In the fine verification stage the transformation, which 
exhibits the highest degree of match between all the model image features and 
the test image features, is considered as the solution. 

It is apparent that because of the potentially large number of possible combina- 
tions of test and model image triplets the process is very time consuming. We 
take advantage of our region-based representation to reduce the number of can- 
didate triplets. For this purpose, we consider only those combinations of interest 
points which are associated with the same region. A further reduction of the 
number of model-test triplets is achieved by filtering out those pairs belonging 
to regions with considerable colour difference. 

In the coarse verification stage we measure the average Euclidean distance be- 
tween the transformed centroid coordinates of the corresponding model and test 
image regions. Furthermore, the number of region correspondences is registered 
as another matching criterion. 

We consider a pair of model-test image regions to match if the differences in the 
colour and area measurements and also in the centroid coordinates of the regions 
fall within a predetermined threshold. Note that for this evaluation we map the 
model features to the test image. Any candidate transformation which provides 
a matching distance below the predetermined threshold, with the number of cor- 
responding regions exceeding the required minimum, passes this pruning stage. 
In the fine verification stage for each hypothesised transformation we measure 
the average Euclidean distance between the boundary samples of a transformed 
model region and the corresponding test image region. This measurement is 
made by searching for the minimum distance between the two strings of sam- 
ples. The search involves considering all the possible shifts between the strings. 
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The transformation between the test image and a candidate model, which yields 
a sufficiently high number of matches, will be accepted as a solution; otherwise 
the procedure continues by selecting another triplet from the test image. In case 
no such transformation between the test image and a candidate model is found 
other models would be considered in turn. 



Geometric Hashing Method: Geometric hashing paradigm is based on an 

intensive off-line model processing stage, where model information is indexed 
into a hash-table using a minimum number of transformation invariant features 
referred to as a basis. In the recognition phase the features in the test image 
are expressed in a similar manner using bases picked from the test image. The 
comparison of the invariant coordinates of model and test feature points in the 
hash table leads to finding the best pair of bases which provides the maximum 
number of matches between the model and test feature points. 

Let us consider the geometric hashing method^ in more detail. Taking the three 
points, eoo, gqi and eio, as a basis, the affine coordinates (C, ??) of an arbitrary 
point, V, can be expressed using the following formula: 

V = C(oio — eoo) + vi^oi — eoo) + eoo 

One of the important features of such a coordinate system is its invariance to 
affine transformations. Consider an arbitrary point and a basis from the same 
plane which are transformed by an affine transformation. The following expres- 
sion shows that the affine coordinate of the point remains invariant: 

Tv = C(Teio — Tgoo) + rj{TeQi — Teoo) H- Tego 

Using this property the model information can be represented in an invariant 
form. For each model image and for each ordered non-collinear triple of the 
feature points the coordinates of all other model points are computed taking 
this triplet as an affine basis of the 2D plane. Each such coordinate is used as 
an entry to a hash table, where the identity of the basis in which the coordinate 
was obtained and the identity of the model (in case of more than one model) 
are recorded. In the recognition phase the given test image is represented in 
the same way. An arbitrary triplet of test image feature points is chosen to 
define a basis and affine coordinates of all the other scene interest points are 
computed based on this basis. For each coordinate the appropriate entry of 
the hash-table is checked and a vote for the basis and the model identity in the 
corresponding entry is noted. After the voting process the hash table is inspected 
to find the model-bases with a large number of votes. Such model-bases define 
hypothesised transformations between the test image and a specific model image. 
In the verification stage the model features are verified against the test image 
features by applying each of the candidate transformations. If the verification 
fails the algorithm will continue by picking another triplet from the test image. 
We implemented a recognition system based on the above geometric hashing 
algorithm. In this system we apply the same pruning strategy as in the alignment 
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method to reduce the number of image bases. The verification of a hypothesised 
transformation is also carried out in the same way as in the fine verification stage 
of the alignment method. 



2.2 Attributed Relational Graph Matching 

In the graph-based method 0 image information is represented in the form of 
an attributed relational graph. In particular all the information associated with 
the object models is represented in a model graph and test image information 
is represented in a test graph. In these graphs each node characterises a region 
described in terms of measurements referred to as unary measurements. The 
affine invariant representation is achieved by presenting the regions in their nor- 
malised form. Furthermore, we measure the relation between a pair of regions 
using binary measurements and represent them as links between graph nodes. 
In the recognition phase we match the test graph against the model graph using 
the probabilistic relaxation labelling technique. In the following, we explain the 
adopted representation and the recognition process in more detail. 



Graph Representation: We start by addressing the problem of finding an 

affine transform that maps the region R to region r in a normalised space where 
the transformed regions should be comparable. 

In order to define such a transform uniquely we impose the following constraints 
on it: 

1. the reference points (xo,2/o) and {xi,yi) of the region R are to be mapped 
to points (1,0) and (0,0) of r respectively. 

2. the normalised region r is to have a unit area and the second order cross 
moment equal to zero. 

To simplify the transformation task we split it into two sub-tasks. First, using a 
similarity transformation matrix Tg, the reference points (a:o,?/o) and (xi, j/i) of 
R are mapped to points (1, 0) and (0, 0) in the new region Rq respectively. The 
matrix Ts can readily be shown to be: 

I f xi-xo yo-yi 0 \ 

Ts = ^ yi-yo xi-xo 0 (1) 

” \a;o + 2/o “ 2/02/1 - XqXi x^yi - yoXi kn / 

where kn = {x\ — X 2 )^ + (yi — 1/2)^ is the distance of the two reference points. 
Second, we determine the affine transform, T^, that modifies the new region, 
i?0j to the normalised region, r. Such a matrix can be calculated taking into 
account the relations between the second order moments of Rq and r : 

1 0 0 \ 

-Ui.i/wo,2fc0 (2) 

0 0 1 / 
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where ui.i 1^0,2 are the second order moments of the normalised region r and 
k = kn/{Area of region R). The transformation matrix that maps region R 
to the normalised region, r, will then be given as T = TsT^. 

Let us define B = T~^. The affine invariant coordinates of an arbitrary point P 
of region R can be defined as : 



Cb(P) = PB-^ (3) 

It has been shown that matrix By = B^Bj”^, associated with a pair of regions 
Ri and Rj, is an affine invariant measurement | 0 ] • We refer to this matrix as the 
binary relation matrix. 

In order to normalise a region we consider its centroid as one of the required 
reference points. The way the second reference point is selected is different for 
the test image and the model. In the case of the model the highest curvature 
point on the boundary of the region is chosen as the second reference point, while 
in the test image for each region a number of points of high curvature are picked 
and consequently a number of representations for each test region are provided. 
The selection of more than one point on the boundary is motivated by the fact 
that an affine transformation may change the ranking and distort the position 
of high curvature points. 

We represent all of the model images in a common graph and refer to it as 
the model graph. Suppose that the model graph contains M nodes (normalised 
regions). Then 17 = {wi, W2, • • • , wm} denotes the set of labels for the nodes 
which define their identity. Each node oji is characterised by a measurement 
vector Xi- X = {^1,0:2, • • • ,xm} is the set of these unary measurement vectors. 
Let Ni be the index set of nodes neighbouring node i. The set of relational 
measurements between a pair of the graph nodes is denoted A = {Aij\(i^ j)i G 
{ 0 , • • • ,M},j G Ni}. As a unary attribute vector for node, uji, we take vector 
Xi = where the components of vector S are the coordinates of a 

number of equally spaced samples on the boundary of the ith normalised region. 
Vector C is a representative of the region colour. The binary measurement vector 
Aij associated with the pair of nodes, a;i,and, Wj,is defined as follows: 

Aij = l, 2 ;n= l,2),ColorRalationij"^ , AreaRalationij] 

where bij^^ is the (m, n)th element of the binary relation matrix By associated 
with the region pair. In this vector Color Relationij and scalar AreaRelationij 
express the colour relations and area ratios of the region pair respectively. 
Similarly, we capture the test image information in a test graph where a = 
{oi, 02, • • • , Oat} is the set of test graph nodes and, X,and, A, denote the set of 
unary and binary measurements respectively. The identity of each graph node, 
Qi, is denoted by 6i. Recall that the representation of the test image differs from 
that of the model in the sense that more than one bases are provided for each 
test region. The multiple representation for each test node is defined in terms of 
a set of unary measurement vectors, a; J, with index k indicating that the vector 
is associated with the kth representation of the ith node. To each neighbouring 
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pair of regions, a^, aj, we associate binary relation vectors with semanti- 
cally identical components to those of the model binary vectors. The multiple 
unary measurement vectors and binary relations Ajj constitute the com- 
bined unary and binary relation representation = {x^\k G {1,---,L}} and 
Aij = {Afjjfc, I G {1, • • • , L}} where, L , denotes the number of representations 
used for the test regions. 



Graph Matching: For matching we have adopted the relaxation labelling 

technique of HU and adapted it to our task. The problem considered here is 
much more complex than in previous applications of relaxation methods due to 
a large number of nodes in the model graph. Similarly to HU we add a null label 
to the label set to reduce the probability of incorrect labelling but we try to 
neutralise the support of neighbouring objects when they take the null label. As 
we discuss later this allows us better to cope with the matching problem. 

We divide the matching process into two stages: first, finding the best represen- 
tation of object under a particular label hypothesis and second, updating the 
label probabilities by incorporating contextual information. In the first stage, we 
compare the unary attribute measurements of each object, Oj, of the test image 
with the same measurements for all its admissible interpretations and construct 
a list, hi, containing the labels which can be assigned to the object. Simultane- 
ously for each label in this list we find the best representation. For assessing each 
representation we measure the mean square distance between the normalised re- 
gion boundary points stored in a vector, S, and the unary attribute vector of 
the hypothesised label. In other words the merit of fcth representation of object 
Qi in the context of the assignment of label uja to that object is evaluated as: 

E{0i = uJa) = min(xJ'[S'] - ia[S'])^ 

k 

In ideal conditions, for the correct basis, the above measurement for the corre- 
sponding regions would be zero. However, due to errors in the extraction of the 
reference points and as a result of the segmentation noise affecting the boundary 
pixel positions this measurement is subject to errors. Thus criterion function 
E is compared against a predefined threshold. A label is entered in the list of 
hypotheses only if the measurement value is less than the threshold. The best 
representation basis, k, is also recorded in the node label list. At the end of this 
process we have a label list for each object with the best representation for each 
label in the list. Hence we do not need to distinguish between different represen- 
tations by superscript indices on the unary and binary vectors. Instead we index 
them with a star to indicate that the best representation is being considered. 

In the second stage, we consider the possible label assignments for each object, 
at, and iteratively update the probabilities using their previous values and sup- 
ports provided by the neighbouring objects. In addition to the above pruning 
measures, at the end of each iteration we eliminate the labels the related prob- 
abilities of which drop below a threshold value. This will make our relaxation 
method faster and more robust as well. Indeed the updating of probabilities of 
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unlikely matches not only takes time but also increases the probability of incor- 
rect assignment due to increased entropy of the interpretation process which is 
a function of not only probability distribution but also of the number of possible 
interpretations. 

Returning to our problem of assigning labels from label set Q to the set of ob- 
jects in the test graph, let p{9i = wgj denote the probability of label ujg. being 
the correct interpretation of object ai. Christmas et al^3 developed a theoret- 
ical underpinning of probabilistic relaxation using a Bayesian framework. They 
derived a formula which iteratively updates the probabilities of different labels 
an object can take using contextual information measured in a neighbourhood 
of the object. To overcome the problem of inexact matching a null label loq is 
added to the label set for assigning to the objects for which no other labels are 
appropriate . 

The probability that the object ai takes label ujg^ is updated in the (n -I- 1) st 
step of the algorithm using the support provided from the neighbouring objects: 






= u;g^)Q(^'>{9, = tag,) 



(4) 



= o;„) = J] ^ = ujp)p{A*^\9, = 9j = up) (5) 

jeNi 

where function Q quantifies the support that assignment {9i = Ua) receives at 
the nth iteration step from the neighbours of object i in the test. 

The summation in the support function measures the average consistency of the 
labelling of object aj{ a neighbour object of ai) in the context of the assign- 
ment, {9i = ug^), on object ai. In each term of this summation the consistency 
of a particular label on object aj is evaluated. The compatibility of the labels 
is measured in terms of the value of the distribution function of binary relation 
measurements between Ui and aj . In El the consistency of the neighbouring ob- 
ject aj even when it takes the null label is involved in the support computation. 
As a result if the consistency of other labels on aj is low(which will frequently 
happen) the contribution to the support provided by the null label will be dom- 
inant and cause an undesirable effect. Since the assignment of the null label to 
Qj does not provide any relevant information for the labelling of object ai its 
support should be neutral. The undesirable behaviour of the null label has been 
confirmed in all experiments. It is frequently manifest in incorrect labelling. To 
avoid this problem we set the support term related to the null label to a constant 
value. Because of the product operator in the definition of support function Q 
this constant cannot be set to zero. 

In the first step of the process the probabilities are initialised based on the unary 
measurements. Denote by p^^\9i = ugj the initial label probabilities evaluated 
using the unary attributes as: 



p^°'>{9^ = ugj= p{9i = ug^\x*) 



( 6 ) 
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Applying the Bayes theorem we have : 






p{x*\9j = ujg,)p{9i = ujeJ 

= ‘^a)p{0i = Ua) 



(7) 



with the normalisation carried out over labels in the label list L^. Let C be the 
proportion of test nodes that will assume the null label. Then the prior label 
probabilities will be given as : 



, „ , f C A = 0 (null label) 

P(«, = o,x) = 1 3^ AjiO ® 

where M is the number of labels (model nodes) . 

We shall assume that the errors on unary measurements are statistically inde- 
pendent and their distribution function is Gaussian i.e. 



P{^i ^u) (9) 

where is a diagonal covariance matrix for measurement vector x* . In the sup- 
port function, Q, the term p{A*j\9i = uja,9j = W/j) behaves as a compatibility 
coefficient in other relaxation methods. In fact it is the density function for the 
binary measurement A*j given the matches 9i = oja and 9j = Similarly, the 
distribution function of binary relations is centred on the model binary measure- 
ment Aap. It is assumed that deviations from this mean are also modelled by a 
Gaussian. Thus we have: 



p(A*^ j9i = oja, 9j = iop) = {Aafj, Eb) ( 10 ) 

where Eb is the covariance matrix of the binary measurement vector A*j . 

The iterative process will be terminated in one of the following circumstances: 

1. If in the last iteration none of the probabilities changed by more than thresh- 
old e. 

2. If the number of iterations reached some specified limit 



3 Experiments 

In this section we describe the experiments carried in order to compare the 
geometric hashing, alignment and graph-based methods. The aim of the exper- 
iments is to compare the recognition methods from the point of view of recog- 
nition performance and the processing time. We test the matching methods on 
two different databases: groceries database and traffic signs database. In each 
database a number of images containing the objects of interest are used as the 
test images and a frontal image of each object is used as the object model. 

In the first experiment we test the methods on the groceries database. The 
database contains images of 24 objects taken under controlled conditions. Each 
object is imaged from 20 different viewing angles ranging from —90 to -1-90 de- 
grees. The set of 20 images for one of the objects is shown in FigQ Fig E| shows 
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Fig. 1. The images of an object of the database taken from different viewing angels 



object models for this database. Note that the resolution of the model images 
is twice that of the test images. This difference in resolution is designed to as- 
sess the ability of the methods to match objects under considerable scaling. The 
recognition rate is measured for the viewing ranges of 45 (pose 9, 10, 11) and 90 
degrees (pose 6 to 15) separately. 

The second experiment is designed to compare the performance of the match- 
ing methods in sever clutter conditions, considerable illumination changes and 
scaling. We have chosen a traffic sign database for this purpose. It consists of 
sixteen outdoor images of traffic signs shown in Fig 0 which are regarded as the 
test images. The object models are shown in FiggI The performance is measured 
in terms of recognition rate. 

In each experiment in order to extract the required information, the test and 
model images are first segmented using a region growing method. For each seg- 
mented region, colour, area, centroid coordinates and the coordinates of the 
boundary points are extracted. The region feature points are obtained by mea- 
suring the curvature along the boundary. The curvature points are recorded in 
the decreasing order of curvature. In the first experiment on the traffic signs 
database we extract seven interest points for each region. The experiment has 
shown that even in the case of sever affine transformation this number of fea- 
tures is sufficient to obtain a model-test triplet match between the corresponding 
regions. In any case to find out how the number of feature points affects the per- 
formance of the individual methods we have experimented with different values 
of this parameter on the groceries database. 

4 Results 

The results of the first experiments are reported in Fig 0 Fig |5(a)| shows the 
recognition rate versus the number of region feature points. For each method 
at a specific number of feature points we provide two measures: the recognition 
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Fig. 2. Boxes database 



rate for pose 9 to 11 and the recognition rate for pose 6 to 15. We draw the 
attention to the most important points which emerged from these results. 

As expected the alignment method performs better than geometric hashing from 
the recognition rate point of view. In contrast to geometric hashing the method 
does not miss any proper transformation during the coarse verification stage. 
However the graph matching method performs best. 

By increasing the number of feature points the performance of the geometric 
hashing and alignment methods improves faster than that of the graph-based 
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Fig. 4. The traffic signs as the library of model images 
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(a) Recognition rates versus the nuin- (b) Processing times versus the num- 
ber of feature points for each region ber of feature points for each region 



Fig. 5. The performances of the methods versus the number of feature points 

Table 1. The result of the experiments on the tra c sign objects for di erent recog- 
nition methods 



Performance 


GeomHashing 


Alignment 


Graph-Based 


Correct Recog 


50% 


62.5% 


88.89% 



method. In fact the performance of the geometric hashing and alignment meth- 
ods totally depends on the accuracy of the feature points coordinates. Using more 
feature points increases the probability of obtaining more accurate model-test 
triplet pairs and consequently the likelihood of recognising the object. For two 
reasons the graph matching method is not affected as much as the other meth- 
ods by this problem. First of all, we use feature points to normalise each region 
individually hence the errors in feature point coordinates are not propagated 
throughout the test image. Second, we use contextual information to moderate 
local imperfections caused by inaccurate region normalisation. 

As Fig p(b)| shows the recognition time for the alignment and geometric hashing 
method is an exponential function of the number of feature points. The align- 
ment recognition time is even worse than the geometric hashing method due to 
the lack of an effective candidate pruning strategy. In the graph-based method 
the number of feature points affects only the first part of the algorithm in which 
the best representation for each region is selected. In other words an increase in 
the number of feature points does not increase the graph complexity and conse- 
quently the matching time remains constant. 

We applied the recognition methods to the traffic sign image database. The 
correct recognition rate is given in Table 0 There are two main reasons for the 
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poor performance of geometric hashing: an excessively high count of false votes 
generated by the clutter, and different complexity of the objects of interest. The 
clutter in an image produces a large number of features which potentially in- 
crease the number of false votes for each model-triplet. Equally, since objects 
may be of very different complexity there are considerable differences in the 
number of feature points among the object models. Thus during hashing it is 
more likely to give a false vote to a complex object than to simple ones. 

To illustrate this problem we plot feature points of a test image in a hash table 
for two different candidates of model-test triplets. In the first case the triplet 
by which the test feature points are normalised has chosen the corresponding 
triplet of the relevant model Fig |6(a)| Each square represents a bin in the hash 
table. It is centred on a normalised feature of the model and each dot represents 
a normalised test feature point. The bin size is determined by error tolerance 
for the corresponding feature points. In the second case we consider test image 
feature points against the feature points of an irrelevant model. The test and 
model images are normalised using an arbitrary pair of model-test triplets. As 
can be seen in Fig |6(a)| the number of feature points (squares) in the relevant 
model is considerably lower than for the irrelevant model shown in Fig |6(b)| . As 
a result the number of votes given to the incorrect triplet-model(case 2) is more 
than two times the number of votes given for the correct triplet-model(case 1). 
In the alignment method this problem with the selection of candidates does not 
arise but we spend much time verifying a large number of candidates. Although 
we use colour information to reduce the number of possible model-test triplets, 
because of the considerable illumination changes in the images of the database 
the colour constraints cannot effectively be applied. The results show that the 
recognition rate for the alignment system is better than for geometric hashing 
but it is not comparable to graph-based method. 



5 Discussion and Conclusion 

An extensive experimental comparative study of three distinct matching meth- 
ods for the recognition of 3D objects from a 2D view was carried out. The 
methods investigated include graph matching, geometric hashing and the align- 
ment technique. 

The experiments were conducted on different databases in order to test the meth- 
ods under different conditions. The results of the experiments demonstrated the 
superiority of the graph matching method over the other two methods in terms 
of both recognition accuracy and the speed of processing. Most probably the 
success of the graph matching approach owes to the fact that the model and test 
images are compared by considering local matches between the normalised im- 
age regions. This is in contrast to the alignment and geometric hashing methods 
which measure the distance between the features of the two images in a global 
coordinate system. A crucial advantage of the graph-based matching method is 
that the global match between object and the model is defined in terms of lo- 
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Fig. 6. Location of normalised test image features with respect to hash table bins in 
the two di erent cases 



cal consistencies. This prevents the propagation of errors throughout the image. 
Moreover the aggregative nature of the graph matching process makes it resilient 
to imperfect local matches caused by occlusion and measurement errors. 

From the computational point of view the complexity of the graph matching 
method allows for the matching task to be completed in a reasonable time. 
Moreover it is independent from the number of features used for representation. 
This is not the case for the alignment and geometric hashing methods whose 
processing times depend on the number of region feature points exponentially. 
The alignment and geometric hashing are accomplished faster than the graph 
matching only when a very few feature points are involved in matching. However 
for this case the alignment and geometric hashing recognition rates are unaccept- 
ably poor. Reasonable performance rates are delivered by the alignment method 
only when five feature points or more are used for region representation. How- 
ever at this point the alignment method ceases to be computationally feasible. 
The geometric hashing as tested in the complex scenes of our database failed to 
perform satisfactorily on all counts. Thus graph based matching proved to be 
the most promising and will be developed further by incorporating appearance 
information in the representation. 
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Abstract. Hierarchical image structures are abundant in computer vi- 
sion, and have been used to encode part structure, scale spaces, and 
a variety of multiresolution features. In this paper, we describe a uni- 
fied framework for both indexing and matching such structures. First, 
we describe an indexing mechanism that maps the topological structure 
of a directed acyclic graph (DAG) into a low-dimensional vector space. 
Based on a novel eigenvalue characterization of a DAG, this topologi- 
cal signature allows us to efficiently retrieve a small set of candidates 
from a database of models. To accommodate occlusion and local defor- 
mation, local evidence is accumulated in each of the DAG’s topological 
subspaces. Given a small set of candidate models, we will next describe 
a matching algorithm that exploits this same topological signature to 
compute, in the presence of noise and occlusion, the largest isomorphic 
subgraph between the image structure and the candidate model structure 
which, in turn, yields a measure of similarity which can be used to rank 
the candidates. We demonstrate the approach with a series of indexing 
and matching experiments in the domains of 2-D and (view-based) 3-D 
generic object recognition. 



1 Introduction 

The indexing and matching of hierarchical (e.g., multiscale or multilevel) image 
features is a common problem in object recognition. Such structures are often 
represented as rooted trees or directed acyclic graphs (DAGs), where nodes rep- 
resent image feature abstractions and arcs represent spatial relations, mappings 
across resolution levels, component parts, etc . The requirements of match- 

ing include computing a correspondence between nodes in an image structure 
and nodes in a model structure, as well as computing an overall measure of dis- 
tance (or, alternatively, similarity) between the two structures. Such matching 
problems can be formulated as largest isomorphic subgraph or largest isomorphic 
subtree problems, for which a wealth of literature exists in the graph algorithms 
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community. However, the nature of the vision instantiation of this problem often 
precludes the direct application of these methods. Due to occlusion and noise, 
no significant isomorphisms may exist between two DAGs or rooted trees. Yet, 
at some level of abstraction, the two structures (or two of their substructures) 
may be quite similar. 

The matching procedure is expensive and must be used sparingly. For large 
databases of object models, it is simply unacceptable to perform a linear search 
of the database. Therefore, an indexing mechanism is essential for selecting a 
small set of candidate models to which the matching procedure is applied. When 
working with hierarchical image structures, in the form of graphs, indexing is a 
challenging task, and can be formulated as the fast selection of a set of candi- 
date models that share a subgraph with the query. But how do we test a given 
candidate without resorting to subgraph isomorphism? If there were a small 
number of subgraphs shared among many models, representing a vocabulary of 
object parts, one could conceive of a two-stage indexing process, in which image 
structures were matched to the part vocabulary, with parts “voting” for can- 
didate models j0|. However, we’re still faced with the complexity of subgraph 
isomorphism, albeit for a smaller database (vocabulary of parts) . 

In this paper, we present a unified solution to the problems of indexing and 
matching hierarchical structures. Drawing on techniques from the domain of 
eigenspaces of graphs, we present a technique that maps any rooted hierarchical 
structure, i.e., DAG or rooted tree, to a vector in a low-dimensional space. The 
mapping not only reduces the dimensionality of the representation, but does 
so while retaining important information about the branching structure, node 
distribution, and overall structure of the graph - information that is critical 
in distinguishing DAGs or rooted trees. Moreover, the technique accommodates 
both noise and occlusion, meeting the needs of an indexing structure for vi- 
sion applications. Armed with a low-dimensional, robust vector representation 
of an input structure, indexing can be reduced to a nearest-neighbor search in a 
database of points, each representing the structure of a model (or submodel) . 

Once a candidate is retrieved by the indexing mechanism, we exploit this 
same eigen-characterization of hierarchical structure to compute a node-to-node 
correspondence between the input and model candidate hierarchical structures. 
We therefore unify our approaches to indexing and matching through a novel rep- 
resentation of hierarchical structure, leading to an efficient and effective frame- 
work for the recognition of hierarchical structures from large databases. In this 
paper, we will review our representation, described in including a new 

analysis on its stability. We then describe the unifying role of our representation 
in the indexing and matching of hierarchical structures. Finally, we demonstrate 
the approach on two separate object recognition domains. 

2 Related Work 

Eigenspace approaches to shape description and indexing are numerous. Due to 
space constraints, we cite only a few examples. Turk and Pentland’s eigenface 



A Unified Framework for Indexing and Matching 



69 



approach |d8| represented an image as a linear combination of a small number 
of basis vectors (images) computed from a large database of images. Nayar and 
Murase extended this work to general 3-D objects where a dense set of views 
was acquired for each object Other eigenspace methods have been applied 
to higher-level features, offering more potential for generic shape description 
and matching. For example, Sclaroff and Pentland compute the eigenmodes of 
vibration of a 2-D region , while Shapiro and Brady looked at how the modes 
of vibration of a set of 2-D points could be used to solve the point correspondence 
problem under translation, rotation, scale, and small skew m 

In an attempt to index into a database of graphs, Sossa and Horaud use a 
small subset of the coefficients of the d 2 -polynomial corresponding to the Lapla- 
cian matrix associated with a graph m, while a spectral graph decomposition 
was reported by Sengupta and Boyer for the partitioning of a database of 3-D 
models, where nodes in a graph represent 3-D surface patches 123 Sarkar m 
and Shi and Malik m have formulated the perceptual grouping and region seg- 
mentation problems, respectively, as graph partitioning problems and have used 
a generalized eigensystem approach to provide an efficient approximation. We 
note that in contrast to many of the above approaches to indexing, the represen- 
tation that we present in this paper is independent of the contents of the model 
database and uses a uniform basis to represent all objects. 

There have been many approaches to object recognition based on graph 
matching. An incomplete list of examples include Sanfeliu and Fu Shapiro 
and Haralick EHEni, Wong et al. EEa, Boyer and Kak 0 (for stereo match- 
ing), Kim and Kak El, Messmer and Bunke m. Christmas et al. 0, Eshera 
and Fu 0, Pellilo et al. m, Gold and Rangarajan cni, Zhu and Yuille and 
Cross and Hancock 0. Although many of these approaches handle both noise 
and occlusion, none unify both indexing and matching through a single spectral 
mechanism. 

3 Indexing Hierarchical Structures 

We make the assumption that if a dagQ has rich structure in terms of depth 
and/or branching factor, its topology alone may serve as a discriminating index 
into a database of model structures. Although false positives (e.g., model DAGs 
that have the same structure, but whose node labels are different) may arise, 
they may be few in number and can be pruned during verification. As stated 
in Section 0 we seek a reduced representation for a DAG that will support 
efficient indexing and matching. An effective topological encoding of a DAG’s 
structure should: 1 ) map a DAG’s topology to a point in some low-dimensional 
space; 2) capture local topology to support matching/indexing in the presence 
of occlusion; 3 ) be invariant to re-orderings of the DAG’s branches, i.e., re- 
orderings which do not affect the parent-child relationships in the DAG; 4 ) be 

^ Although a hierarchical structure can take the form of a DAG or rooted tree, we 
will henceforth limit our discussion to DAGs, since a rooted tree is a special case of 
a DAG. 
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as unique as possible, i.e., different DAGs should have different encodings; 5) 
be stable, i.e., small perturbations of a DAG’s topology should result in small 
perturbations of the index; and 6) should be efficiently computed. 



3.1 An Eigen-Decomposition of Structure 

To describe the topology of a DAG, we turn to the domain of eigenspaces of 
graphs, first noting that any graph can be represented as a symmetric {0, 1, —1} 
adjacency matrix, with I’s (-I’s) indicating a forward (backward) edge between 
adjacent nodes in the graph (and O’s on the diagonal). The eigenvalues of a 
graph’s adjacency matrix encode important structural properties of the graph. 
Furthermore, the eigenvalues of a symmetric matrix A are invariant to any or- 
thonormal transformation of the form P^AP. Since a permutation matrix is or- 
thonormal, the eigenvalues of a graph are invariant to any consistent re-ordering 
of the graph’s branches. However, before we can exploit a graph’s eigenvalues 
for indexing purposes, we must establish their stability under minor topological 
perturbation, due to noise, occlusion, or deformation. 

We will begin by showing that any structural change to a graph can be 
modeled as a two-step transformation of its original adjacency matrix. The first 
step transforms the graph’s original adjacency matrix to a new matrix having 
the same spectral properties as the original matrix. The second step adds a 
noise matrix to this new matrix, representing the structural changes due to 
noise and/or occlusion. These changes take the form of the addition/deletion 
of nodes/arcs to/from the original graph. We will then draw on an important 
result that relates the distortion of the eigenvalues of the matrix resulting from 
the first step to the magnitude of the noise added in the second step. Since 
the eigenvalues of the original matrix are the same as those of the transformed 
matrix (first step) , the noise-dependent eigenvalue bounds therefore apply to the 
original matrix. The result will establish the insensitivity of a graph’s spectral 
properties to minor topological changes. 

Let’s begin with some definitions. Let A^ G {0, 1, denote the adja- 

cency matrix of the graph G on to vertices, and assume H is an n-vertex graph 
obtained by adding n — m new vertices and a set of edges to the graph G. Let 

: {0, 1, — — >■ {0,1,—!}"^”, be a lifting operator which transforms a 
subspace of subspace of with n> m. We will call this opera- 

tor spectrum preserving if the eigenvalues of any matrix A G {0,1,— and 
its image with respect to the operator (>F(A)) are the same up to a degeneracy, 
i.e., the only difference between the spectra of A and 'f'(A) is the number of zero 
eigenvalues (A' (A) has n — m more zero eigenvalues then A). 

As stated above, our goal is to show that any structural change in graph G can 
be represented in terms of a spectrum preserving operator and a noise matrix. 
Specifically, if A^ denotes the adjacency matrix of the graph H, then there exists 
a spectrum preserving operator <F() and a noise matrix Eh G (0, 1, — l}"xn 
that: 






( 1 ) 
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We will define W{) as a lifting operator consisting of two steps. First, we will 
add n — m zero rows and columns to the matrix , and denote the resulting 
matrix by A!^. Next, A'^ will be pre- and post-multiplied by a permutation 
matrix P and its transpose P*, respectively, aligning the rows and columns 
corresponding to the same vertices in and 'I'{A^). Since the only difference 
between the eigenvalues of A'^ and A^ is the number of zero eigenvalues, and 
PA'^P^ has the same set of eigenvalues as the matrix A'^, !F() is a spectrum 
preserving operator. As a result, the noise matrix can be represented as 

A„-P(A^)e{0,i,-irx-. 

Armed with a spectrum-preserving lifting operator and a noise matrix, we 
can now proceed to quantify the impact of the noise on the original graph’s 
eigenvalues. Specifically, let \k for k G {1, ..., n} denote the largest eigenvalue 
of the matrix A. A seminal result of Wilkinson |2S| (see also Stewart and Sun m) 
states that: 

Theorem 1. If A and A + E are n x n symmetric matrices, then: 

< Xk{A + E) < Xk{A) + Xi{E), for all fc G {1, ..., n} (2) 

For H and G, we know that Xi{Ej^) = ||£'^|| 2 . Therefore, using the above 
theorem, for all k G {1, 

|Afe(A„) - Afe(P(AJ)| = |Afc(P(AJ + E,) - Afe(tf'(AJ)| 

< max{|Ai(P^)|,|A„(P^)|} (3) 

= II^^hI|2. 

The above chain of inequalities gives a precise bound on the distortion of the 
eigenvalues of E{A^) in terms of the largest eigenvalue of the noise matrix P^. 
Since is a spectrum preserving operator, the eigenvalues of follow the 
same bound in their distortions. 

The above result has several important consequences for our application of a 
graph’s eigenvalues to graph indexing. Namely, if the perturbation E^ is small 
in terms of its complexity, then the eigenvalues of the new graph H (e.g., the 
query graph) will remain close to their corresponding non-zero eigenvalues of the 
original graph G (e.g., the model graph), independent of where the perturbation 
is applied to G. The magnitude of the eigenvalue distortion is a function of the 
number of vertices added to the graph due to the noise or occlusion. Specifically, 
if the noise matrix E^j introduces k new vertices to G, then the distortion of 
every eigenvalue can be bounded by fyfc — 1 (Neumaier [IB|)- This bound can 
be further tightened if the noise matrix has simple structure. For example, if 
E^ represents a simple path on k vertices, then its norm can be bounded by 
(2 cos 7r/(fc -I- 1)) (Lovasz and Pelikan [E|)- In short, large distortions are due 
to the introduction/deletion of large, complex subgraphs to/from G, while small 
structural changes will have little impact on the higher order eigenvalues G. The 
eigenvalues of a graph are therefore stable under minor perturbations in graph 
structure. 
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3.2 Formulating an Index 



Having established the stability of a DAG’s eigenvalues under minor perturbation 
of the graph, we can now proceed to define an index based on the eigenvalues. We 
could, for example, define a vector to be the sorted eigenvalues of a DAG, with 
the resulting index used to retrieve nearest neighbors in a model DAG database 
having similar topology. However, for large DAGs, the dimensionality of the 
index (and model DAG database) would be prohibitively large. Our solution to 
this problem will be based on eigenvalue sums rather than on the eigenvalues 
themselves. 

Specifically, let T be a DAG whose maximum branching factor is hi(T), and 
let the subgraphs of its root be Ti, T 2 , . . . , Tg. For each subgraph, Ti, whose 
root degree is S{Ti), we compute the eigenvalues of Ti’s submatrix, sort the 
eigenvalues in decreasing order by absolute value, and let Si be the sum of the 
S{Ti) — 1 largest absolute values. The sorted Si's become the components of a 
Z\(T)-dimensional vector assigned to the DAG’s root. If the number of Si’s is less 
than A{T), then the vector is padded with zeroes. We can recursively repeat this 
procedure, assigning a vector to each nonterminal node in the DAG, computed 
over the subgraph rooted at that node. The reasons for computing a description 
for each node, rather than just the root, will become clear in the next section. 

Although the eigenvalue sums are invariant to any consistent re-ordering of 
the DAG’s branches, we have given up some uniqueness (due to the summing 
operation) in order to reduce dimensionality. We could have elevated only the 
largest eigenvalue from each subgraph (non-unique but less ambiguous), but 
this would be less representative of the subgraph’s structure. We choose the 
5{Ti) — 1 largest eigenvalues for two reasons: 1) the largest eigenvalues are more 
informative of subgraph structure, and 2) by summing 5(Ti) — 1 elements, we 
effectively normalize the sum according to the local complexity of the subgraph 
root. 

To efficiently compute the submatrix eigenvalue sums, we turn to the domain 
of semidefinite programming. A symmetric n x n matrix A with real entries is 
said to be positive semidefinite, denoted as A ^ 0, if for all vectors x G i?", 
x^Ax > 0, or equivalently, all its eigenvalues are non-negative. We say that 
U y V if the matrix U — V is positive semidefinite. For any two matrices U 
and V having the same dimensions, we define U •¥ as their inner product, i.e., 
U»V^ EE UijVij. For any square matrix U, we define trace(17) = J2i Ui,i- 
i 3 

Let / denote the identity matrix having suitable dimensions. The following re- 
sult, due to Overton and Womersley characterizes the sum of the first k 
largest eigenvalues of a symmetric matrix in the form of a semidefinite convex 
programming problem: 



Theorem 2 (Overton and Womersley [19] 1 . For the sum of the first k 
eigenvalues of a symmetric matrix A, the following semidefinite programming 



A Unified Framework for Indexing and Matching 



73 



characterization holds: 

Ai(A) + . . . + Afe(A) = max A*U 

s.t. trace(J7) = k (4) 

OAU ^ I, 

The elegance of Theorem 0 lies in the fact that the equivalent semidefinite 
programming problem can be solved, for any desired accuracy e, in time poly- 
nomial in 0{ny/nL) and log^, where L is an upper bound on the size of the 
optimal solution, using a variant of the Interior Point method proposed by Al- 
izadeh fQ. In effect, the complexity of directly computing the eigenvalue sums is a 
significant improvement over the O(n^) time required to compute the individual 
eigenvalues, sort them, and sum them. 

3.3 Properties of the Index 

Our topological index satisfies the six criteria outlined in Section d The eigen- 
decomposition yields a low-dimensional (criterion 1) vector assigned to each node 
in the DAG, which captures the local topology of the subgraph rooted at that 
node (criterion 2 - this will be more fully explained in Section 13.41 . Further- 
more, a node’s vector is invariant to any consistent re-ordering of the node’s 
subgraphs (criterion 3). The components of a node’s vector are based on sum- 
ming the largest eigenvalues of its subgraph’s adjacency submatrix. Although 
our dimensionality-reducing summing operation has cost us some uniqueness, 
our partial sums still have very low ambiguity (criterion 4) . From the sensitivity 
analysis in Section 13. II we have shown our index to be stable to minor per- 
turbations of the DAG’s topology (criterion 5). As shown in Theorem E| these 
sums can be computed even more efficiently (criterion 6) than the eigenvalues 
themselves. The vector labeling of all DAGs isomorphic to T not only has the 
same vector labeling, but spans the same subspace in Moreover, this 

extends to any DAG which has a subgraph isomorphic to a subgraph of T. 

3.4 Candidate Selection 

Given a query DAG corresponding to an image, our task is to search the model 
DAG database for one or more model DAGs which are similar to the image 
DAG. If the number of model DAGs is large, a linear search of the database is 
intractable. Therefore, the goal of our indexing mechanism is to quickly select a 
small number of model candidates for verification. Those candidates will share 
coarse topological structure with the image DAG (or one of its subgraphs, if 
it is occluded or poorly segmented). Hence, we begin by mapping the topology 
of the image DAG to a set of indices that capture its structure, discounting 
any information associated with its nodes. We then describe the structure of 
our model database, along with our mechanism for indexing into it to yield a 
small set of model candidates. Finally, we present a local evidence accumulation 
procedure that will allow us to index in the presence of occlusion. 
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A Database for Model DAGs. Our eigenvalue characterization of a DAG’s 
topology suggests that a model DAG’s topological structure can be represented 
as a vector in i5-dimensional space, where <5 is an upper bound on the degree 
of any vertex of any image or model DAG. If we could assume that an image 
DAG represents a properly segmented, unoccluded object, then the vector of 
eigenvalue sums, which we will call the topological signature vector (or TSV), 
computed at the image DAG’s root, could be compared with those topological 
signature vectors representing the roots of the model DAGs. The vector distance 
between the image DAG’s root TSV and a model DAG’s root TSV would be 
inversely proportional to the topological similarity of their respective DAGs, as 
finding two subgraphs with “close” eigenvalue sums represents an approximation 
to finding the largest isomorphic subgraph. 

Unfortunately, this simple framework cannot support either cluttered scenes 
or large occlusion, both of which result in the addition or removal of signifi- 
cant structure. In either case, altering the structure of the DAG will affect the 
TSV’s computed at its nodes. The signatures corresponding to the roots of those 
subgraphs (DAGs) that survive the occlusion will not change. However, the sig- 
nature of a root of a subgraph that has undergone any perturbation will change 
which, in turn, will affect the signatures of any of its ancestor nodes, including 
the root of the entire DAG. We therefore cannot rely on indexing solely with the 
root’s signature. Instead, we will exploit the local subgraphs that survive the 
occlusion. 

We can accommodate such perturbations through a local indexing frame- 
work analogous to that used in a number of geometric hashing methods, e.g., 
fnm . Rather than storing a model DAG’s root signature, we will store the 
signatures of each node in the model DAG, along with a pointer to the object 
model containing that node as well as a pointer to the corresponding node in 
the model DAG (allowing access to node label information). Since a given model 
subgraph can be shared by other model DAGs, a given signature (or location 
in (5-dimensional space) will point to a list of (model object, model node) or- 
dered pairs. At runtime, the signature at each node in the image DAG becomes 
a separate index, with each nearby candidate in the database “voting” for one 
or more (model object, model node) pairs. Nearby candidates can be retrieved 
using a nearest neighbor retrieval method. In our implementation, model points 
were stored in a Voronoi database, whose off-line construction (decomposition) is 
0((fcn) L(^+i)/2J+i) + 0((fcn)L(^+i)/2J bg(A:n)) (|ZIj), and whose run-time search 
is O (log^ (kn)) for fixed S [ 231 ; details are given in 1331 . 



Accumulating Local Evidence. Each node in the image DAG will generate 
a set of (model object, model node) votes. To collect these votes, we set up an 
accumulator with one bin per model object. Furthermore, we can weight the 
votes that we add to the accumulator. For example, if the label of the model 
node is not compatible with the label of its corresponding image node, then the 
vote is discarded, i.e., it receives a zero weight. If the nodes are label-compatible. 
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then we can weight the vote according to the distance between their respective 
TSV’s - the closer the signatures, the more weight the vote gets. 

We can also weight the vote according to the complexity of its correspond- 
ing subgraph, allowing larger and more complex subgraphs (or “parts” ) to have 
higher weight. This can be easily accommodated within our eigenvalue frame- 
work, for the richer the structure, the larger its maximum eigenvalue: 

Theorem 3 (Lovasz and Pelikan ||lS|h Among the graphs with n vertices, 
the star graph has the largest eigenvalue {^/n— 1), while the path on 

n nodes (P„) has the smallest eigenvalue (2cos7r/(n -|- 1)). 



Since the size of the eigenvalues, and hence their sum, is proportional to both the 
branching factor as well as the number of nodes, the magnitude of the signature 
is also used to weight the vote. If we let u be the TSV of an image DAG node 
and v the TSV of a model DAG node that is sufficiently close, the weight of 
the resulting vote, i.e., the local evidence for the model, is computed as (we use 

p=2): 



W = 



1 -I- llu — u| 



( 5 ) 



Once the evidence accumulation is complete, those models whose support 
is sufficiently high are selected as candidates for verification. The bins can, in 
effect, be organized in a heap, requiring a maximum of O(logfc) operations to 
maintain the heap when evidence is added, where k is the number of non-zero 
object accumulators. Once the top-scoring models have been selected, they must 
be individually verified according to some matching algorithm. 



4 Matching Hierarchical Structures 

Each of the top-ranking candidates emerging from the indexing process must be 
verified to determine which is most similar to the query. If there were no clutter, 
occlusion, or noise, our problem could be formulated as a graph isomorphism 
problem. If we allowed clutter and limited occlusion, we would search for the 
largest isomorphic subgraphs between query and model. Unfortunately, with the 
presence of noise, in the form of the addition of spurious graph structure and/or 
the deletion of salient graph structure, large isomorphic subgraphs may simply 
not exist. It is here that we call on our eigen-characterization of graph structure 
to help us overcome this problem. 

Each node in our graph (query or model) is assigned a TSV, which reflects the 
underlying structure in the subgraph rooted at that node. If we simply discarded 
all the edges in our two graphs, we would be faced with the problem of finding the 
best correspondence between the nodes in the query and the nodes in the model; 
two nodes could be said to be in close correspondence if the distance between 
their TSVs (and the distance between their domain-dependent node labels) was 
small. In fact, such a formulation amounts to finding the maximum cardinality, 
minimum weight matching in a bipartite graph spanning the two sets of nodes. 
At first glance, such a formulation might seem like a bad idea (by throwing away 
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all that important graph structure!) until one recalls that the graph structure 
is really encoded in the node’s TSV. Is it then possible to reformulate a noisy, 
largest isomorphic subgraph problem as a simple bipartite matching problem? 

Unfortunately, in discarding all the graph structure, we have also discarded 
the underlying hierarchical structure. There is nothing in the bipartite graph 
matching formulation that ensures that hierarchical constraints among corre- 
sponding nodes are obeyed, i.e., that parent/child nodes in one graph don’t 
match child/parent nodes in the other. This reformulation, although softening 
the overly strict constraints imposed by the largest isomorphic subgraph formu- 
lation, is perhaps too weak. We could try to enforce the hierarchical constraints 
in our bipartite matching formulation, but no polynomial-time solution is known 
to exist for the resulting formulation. Clearly, we seek an efficient approximation 
method that will find corresponding nodes between two noisy, occluded DAGs, 
subject to hierarchical constraints. 

Our algorithm, a modification to Reyner’s algorithm m, combines the above 
bipartite matching formulation with a greedy, best-first search in a recursive pro- 
cedure to compute the corresponding nodes in two rooted DAGs. As in the above 
bipartite matching formulation, we compute the maximum cardinality, minimum 
weight matching in the bipartite graph spanning the two sets of nodes. Edge 
weight will encode a function of both topological similarity as well as domain- 
dependent node similarity. The result will be a selection of edges yielding a 
mapping between query and model nodes. As mentioned above, the computed 
mapping may not obey hierarchical constraints. We therefore greedily choose 
only the best edge (the two most similar nodes in the two graphs, representing 
in some sense the two most similar subgraphs), add it to the solution set, and 
recursively apply the procedure to the subgraphs defined by these two nodes. 
Unlike a traditional depth-first search which backtracks to the next statically- 
determined branch, our algorithm effectively recomputes the branches at each 
node, always choosing the next branch to descend in a best-first manner. In 
this way, the search for corresponding nodes is focused in corresponding sub- 
graphs (rooted DAGs) in a top-down manner, thereby ensuring that hierarchical 
constraints are obeyed. 

Before formalizing our algorithm, some definitions are in order. Let G = 
{V\,Ei) and H = (V2, E2) be the two DAGs to be matched, with \Vi \ = ni and 
IV2I = n2- Define d to be the maximum degree of any vertex in G and H, i.e., 
d = ma,x{S{G), 6 {H)). For each vertex v, we define x(u) G as the unique 

topological signature vector (TSV), introduced in Section rr#l Furthermore, for 
any pair of vertices u and v, let G{u, v) denote the domain dependent node label 
distance between vertices u and v. Finally, let <P{G, H) (initially empty) be the 
set of final node correspondences between G and H, representing the solution to 
our matching problem. 

^ Note that if the maximum degree of a node is d, then excluding the edge from the 
node’s parent, the maximum number of children is d — 1 . Also note that if S{v) < d, 
then then the last d — S(v) entries of x are set to zero to ensure that all x vectors 
have the same dimension. 
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The algorithm begins by forming & ni x matrix II{G,H) whose (u,f)-th 
entry has the value C(m,u)||x(m)— x(n)|| 2 , assuming that u and v are compatible 
in terms of their node labels, and has the value oo otherwise. Next, we form a 
bipartite edge weighted graph G {Vi,V 2 , Eg) with edge weights from the matrix 
7T(G, H)E Using the scaling algorithm of Goemans, Gabow, and Williamson 
0, we then find the maximum cardinality, minimum weight matching in G- 
This results in a list of node correspondences between G and H, called Adi, that 
can be ranked in decreasing order of similarity. 

From Adi, we choose (iti, vi) as the pair that has the minimum weight among 
all the pairs in Adi, i.e., the first pair in Adi. (ui,ui) is removed from the 
list and added to the solution set ^(G, id), and the remainder of the list is 
discarded. For the rooted subgraphs G„^ and id„j of G and H, rooted at nodes Ui 
and vi, respectively, we form the matrix U{Gu^, Hy^) using the same procedure 
described above. Once the matrix is formed, we find the matching Ad 2 in the 
bipartite graph defined by weight matrix iT(G„^, id„j), yielding another ordered 
list of node correspondences. The procedure is recursively applied to {u 2 ,V 2 ), 
the edge with minimum weight in A^ 2 , with the remainder of the list discarded. 

This recursive process eventually reaches the bottom of the DAGs, forming 
a list of ordered correspondence lists (or matchings) {Adi, . . . , Ad^}. In back- 
tracking step i, we remove any subgraph from the graphs Gi and Hi whose roots 
participate in a matching pair in d>{G, H) (we enforce a one-to-one correspon- 
dence of nodes in the solution set). Then, in a depth-first manner, we recompute 
Adi on the subgraphs rooted at Ui and Vi (with solution set nodes removed). 
As before, we choose the minimum weight matching pair, and recursively de- 
scend. Unlike a traditional depth-first search, we are dynamically recomputing 
the branches at each node in the search tree. Processing at a particular node 
will terminate when either rooted subgraph loses all of its nodes to the solution 
set. The precise algorithm is given in Figure Dl additional details and examples 
are given in |2nE3- 

In terms of algorithmic complexity, observe that during the depth-first con- 
struction of the matching chains, each vertex in G or iJ will be matched 
at most once in the forward procedure. Once a vertex is mapped, it will 
never participate in another mapping again. The total time complexity of 
constructing the matching chains is therefore bounded by 0(n^\/n log log n), 
for n = max(ni,n 2 ) [ 0 |. Moreover, the construction of the x(u) vectors will 
take 0{n^/nL) time, implying that the overall complexity of the algorithm is 
max(0(n^ynloglog n),0{n'^y/nL). The above algorithm therefore provides, in 
polynomial time better than 0{n^) an approximate optimal solution to the 
largest isomorphic subgraph problem in the presence of noise. 



® G(A, B, E) is a weighted bipartite graph with weight matrix W = [wp] of size 
|A| X |B| if, for all edges of the form {i,j) £ E, i j ^ B, and (i,j) has an 

associated weight = Wij. 
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procedure isomorphism(G,7/) 

<?(G,77) ^ 0 
d ■(— max((5(G), 5(7/)) 

for u £Vg compute x(w) € R'^~^ (see Section IT^ 
for V G Vh compute x(w) G (see Section 

call match(root(G),root(//)) 
return(cost(^(G, H)) 



procedure match(u,u) 

do 

{ 

let Gu rooted subgraph of G at u 
let H-a <— rooted subgraph of // at u 
compute |Vg„| x \Vh^ \ weight matrix Il{Gu,H^) 

M •<— max cardinality, minimum weight bipartite matching 

in G{Vgu,Vhv) with weights from n(Gu,H^) (see 0 ) 
{u' , v') <r- minimum weight pair in M 
<d>{G,H)^<d>{G,H)Vj{{u',v')} 
call match(M',u') 

Gu^Gu- {x\x G Vg„ and {x, w) G <P{G, H)} 
i/„ ^ i/„ - {y\y G Vff„ and (w,y) G <P{G,H)} 

} 

while {Gu 7 ^ 0 and Hu 7 ^ 0 ) 



Fig. 1. Algorithm for Matching Two Hierarchical Structures. 



5 Demonstration 

In this section, we briefly illustrate our unifled approach to indexing and match- 
ing on two different object recognition domains. 



5.1 2-D Generic Object Recognition 

To demonstrate our approach to indexing, we turn to the domain of 2-D object 
recognition PSE3- We adopt a representation for 2-D shape that is based on 
a coloring of the shocks (singularities) of a curve evolution process acting on 
simple closed curves in the plane Any given 2-D shape gives rise to a rooted 
shock tree, in which nodes represent parts (whose labels are drawn from four 
qualitatively-defined classes) and arcs represent relative time of formation (or 
relative size). Figure 0illustrates a 2-D shape, along with its corresponding shock 
tree. 

We demonstrate our indexing algorithm on a database of 60 object silhou- 
ettes. In Figure0 query shapes are shown in the left column, followed by the top 
ten database candidates (based on accumulator scores), ordered left to right. The 
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Fig. 2. An illustrative example taken from . The labels on the shocks of the hammer 
(left) correspond to vertices in the derived shock graph (right). 



candidate in the box is the closest candidate, found by computing the distance 
(using the matcher) between the query and each database shape (linear search). 
For unoccluded shapes, the results are very encouraging, with the correct can- 
didate ranking very highly. With increasing occlusion, high indexing ambiguity 
(smaller unoccluded subtrees are less distinctive and “vote” for many objects 
that contain them) leads to slightly decreased performance, although the target 
object is still ranked highly. We are investigating the incorporation of more node 
information, as well as the encoding of subtree relations (currently, each subtree 
votes independently, with no constraints among subtrees enforced) to improve 
indexing performance. 

5.2 View-Based 3-D Object Recognition 

For our next demonstration, we turn to the domain of view-based object recog- 
nition, in which salient blobs are detected in a multiscale wavelet decomposition 
of an image 0. Figure 0] shows three images of an origami figure, along with 
their computed multiscale blob analyses (shown inverted for improved visibil- 
ity). Each image yields a DAG, in which nodes correspond to blobs, and arcs 
are directed from blobs at coarser scales to blobs at finer scales if the distance 
between their centers does not exceed the sum of their radii. Node similarity is 
a function of the difference in saliency between two blobs. 

In Figure!^ we show the results of matching the first and second, and first 
and third images, respectively, in Figure 0 The first and third images are taken 
from similar viewpoints (approx 15° apart), so the match yields many good 
correspondences. However, for the first and third images, taken from different 
viewpoints (approx 75° apart), fewer corresponding features were found. Note 
that since only the DAG structure is matched (and not DAG geometry), in- 
correct correspondences may arise when nodes have similar saliency and size 
but different relative positions. A stronger match can be attained by enforcing 
geometric consistency (see PI). 
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Fig. 3. Indexing Demonstration. Shape on left is query shape, followed by top ten 
candidates in decreasing score from left to right. Boxed candidate is closest to query 
(using a linear search based on matcher). 



6 Conclusions 

We have presented a unified solution to the tightly-coupled problems of index- 
ing and matching hierarchical structures. The structural properties of a DAG 
are captured by the eigenvalues of its corresponding adjacency matrix. These 
eigenvalues, in turn, can be combined to yield a low-dimensional vector repre- 
sentation of DAG structure. The resulting vectors can be used to retrieve, in the 
presence of noise and occlusion, structurally similar candidates from a database 
using efficient nearest-neighbor searching methods. Moreover, these same vectors 
contribute to the edge weights in a recursive bipartite matching formulation that 
computes an approximation to the largest isomorphic subgraph of two graphs 
(query and model) in the presence of noise and occlusion. Our formulation is gen- 
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Fig. 4. Top row contains original images, while bottom row (shown inverted for im- 
proved visibility) contains corresponding multiscale blob analyses (see text). 



eral and is applicable to the indexing and matching of any rooted hierarchical 
structure, whether DAG or rooted tree. The only domain-dependent component 
is the node label distance function, which is used in conjunction with the topo- 
logical distance function to compute a bipartite edge weight. We have tested the 
approach extensively on the indexing and matching of shock graphs, and have 
only begun to test the approach on other domains, including the preliminary 
results reported in this paper in the domain of multiscale blob matching. 
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Abstract. The task of visual classification is the recognition of an object in the 
image as belonging to a general class of similar objects, such as a face, a car, a 
dog, and the like. This is a fundamental and natural task for biological visual 
systems, but it has proven difficult to perform visual classification by artificial 
computer vision systems. The main reason for this difficulty is the variability of 
shape within a class: different objects vary widely in appearance, and it is 
difficult to capture the essential shape features that characterize the members of 
one category and distinguish them from another, such as dogs from cats. 

In this paper we describe an approach to classification using a fragment-based 
representation. In this approach, objects within a class are represented in terms 
of common image fragments that are used as building blocks for representing a 
large variety of different objects that belong to a common class. The fragments 
are selected from a training set of images based on a criterion of maximizing 
the mutual information of the fragments and the class they represent. For the 
purpose of classification the fragments are also organized into types, where 
each type is a collection of alternative fragments, such as different hairline or 
eye regions for face classification. During classification, the algorithm detects 
fragments of the different types, and then combines the evidence for the 
detected fragments to reach a final decision. Experiments indicate that it is 
possible to trade off the complexity of fragments with the complexity of the 
combination and decision stage, and this tradeoff is discussed. 

The method is different from previous part-based methods in using class- 
specific object fragments of varying complexity, the method of selecting 
fragments, and the organization into fragment types. Experimental results of 
detecting face and car views show that the fragment-based approach can 
generalize well to a variety of novel image views within a class while 
maintaining low mis-classification error rates. We briefly discuss relationships 
between the proposed method and properties of parts of the primate visual 
system involved in object perception. 



1 Introduction 



The general task of visual object recognition can be divided into two related, but 
somewhat different tasks - classification and identification. Classification is 
concerned with the general description of an object as belonging to a natural class of 
similar objects, such as a face or a dog. Identification is a more specific level of 
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recognition, that is, the recognition of a specific individual within a class, such as the 
face of a particular person, or the make of a particular car. For human vision, 
classification is a natural task: we effortlessly classify a novel object as a person, dog, 
car, house, and the like, based on its appearance. Even a three-year old child can 
easily classify a large variety of images of many natural classes. Furthermore, the 
general classification of an object as a member of a general class such as a car, for 
example, is usually easier than the identification of the specific make of the car [25]. 
In contrast, current computer vision systems can deal more successfully with the task 
of identification compared with classification. This may appear surprising, because 
specific identification requires finer distinctions between objects compared with 
general classification, and therefore the task appears to be more demanding. 

The main difficulty faced by a recognition and classification system is the problem 
of variability, and the need to generalize across variations in the appearance of objects 
belonging to the same class. Different dog images, for example, can vary widely, 
because they can represent different kinds of dogs, and for each particular dog, the 
appearance will change with the imaging conditions, such as the viewing angle, 
distance, and illumination conditions, with the animal’s posture, and so on. The 
visual system is therefore constantly faced with views that are different from all other 
views seen in the past, and it is required to generalize correctly from past experience 
and classify correctly the novel image. The variability is complex in nature: it is 
difficult to provide, for instance, a precise definition for all the allowed variations of 
dog images. The human visual system somehow learns the characteristics of the 
allowed variability from experience. This makes classification more difficult for 
artificial systems than individual identification. In performing identification of a 
specific car, say, one can supply the system with a full and exact model of the object, 
and the expected variations can be described with precision. This is the basis for 
several approaches to identification, for example, methods that use image 
combinations [29] or interpolation [22] to predict the appearance of a known object 
under given viewing conditions. In classification, the range of possible variations is 
wider, since now, in addition to variations in the viewing condition, one must also 
contend with variations in shape of different objects within the same class. 

In this paper we outline an approach to the representation of object classes, and to 
the task of visual classification, that we call a fragment-based representation. In this 
approach, images of objects within a class are represented in terms of class-specific 
fragments. These fragments provide common building blocks that can be used, in 
different combinations, to represent a large variety of different images of objects 
within the class. Following a brief review of related approaches, we discuss the 
problem of selecting a set of fragments that are best suited for representing a class of 
related objects, given a set of example images. We then illustrate the use of these 
fragments to perform classification. Finally, we conclude with some comments about 
similarities between the proposed approach and aspects of the human visual system. 
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2 A Brief Review of Related Past Approaches 



A variety of different approaches have been proposed in the past to deal with visual 
recognition, including the tasks of general classification and the identification of 
individual objects. We will review here briefly some of the main approaches, 
focusing in particular on methods that are applicable to classification, and that are 
related to the approach developed here. 

A popular framework to classification is based on representing object views as 
points in a high-dimensional feature space, and then performing some partitioning of 
the space into regions corresponding to the different classes. Typically, a set of n 
different measurements are applied to the image, and the results constitute an n- 
dimensional vector representing the image. A variety of different measures have been 
proposed, including using the raw image as a vector of grey-level values, using global 
measures such as the overall area of the object’s image, different moments, Fourier 
coefficients describing the object’s boundary, or the results of applying selected 
templates to the image. Partitioning of the space is then performed using different 
techniques. Some of the frequently used techniques include nearest-neighbor 
classification to class representatives using, for example, vector quantization 
techniques, nearest-neighbor to a manifold representing a collection of object or class 
views [17], separating hyperplanes performed, for example, by Perceptron-type 
algorithms and their extensions [15], or, more optimally, by support vector machines 
[30]. The vector of measurements may also serve as an input to a neural network 
algorithm that is trained to produce different outputs for inputs belonging to different 
classes [21]. 

More directly related to our approach are methods that attempt to describe all 
object views belonging to the same class using a collection of some basic building 
blocks and their configuration. One well-known example is the Recognition By 
Components (RBC) method [3] and related schemes using generalized cylinders as 
building blocks [4, 12, 13]. The RBC scheme uses a small number of generic 3-D 
parts such as cubes, cones, and cylinders. Objects are described in terms of their main 
3-D parts, and the qualitative spatial relations between parts. 

Other part-based schemes have used 2-D local image features as the underlying 
building blocks. These building blocks were typically small simple image features, 
together with a description of their qualitative spatial relations. Examples of such 
features include local image patches [1, 18], corners, direct output of local receptive 
fields of the type found in primary visual cortex [7], wavelet functions [26], simple 
line or edge configurations [23], or small texture patches [32]. 

The eigenspace approach [28] can also be considered as belonging to this general 
approach. In this method, a collection of objects within a class, such as a set of faces, 
are constructed using combinations of a fixed set of building blocks. The training 
images are described as a set of grey-level vectors, and the principle components of 
the training images are extracted. The principal components are then used as the 
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building blocks for describing new images within the class, using linear combination 
of the basic images. For example, a set of ‘eigenfaces’ is extracted and used to 
represent a large space of possible faces. In this approach, the building blocks are 
global rather than local in nature. As we shall see in the next section, the building 
blocks selected by our method are intermediate in complexity: they are considerably 
more complex than simple local features used in previous approaches, but they still 
correspond to partial rather than global object views. 



3 The Selection of Class-Based Fragments 



The classification of objects using a fragment-based representation raises two main 
problems. The first is the selection of appropriate fragments to represent a given class, 
based on a set of image examples. The second is performing the actual classification 
based on the fragment representation. In this section we deal with the first of these 
problems, the selection of fragments that are well-suited for the classification task. 
Subsequent sections will then deal with the classification process. 

Our method for the selection and use of basic building blocks for classification is 
different from previous approaches in several respects. First, unlike other methods 
that use local 2-D features, we do not employ universal shape features. That is, the set 
of basic building blocks used as shape primitives are not the same for all classes, as 
used, for instance, in the RBC approach. Instead, we use object fragments that are 
specific to a class of objects, taken directly from example views of objects in the same 
class. As a result, the shape fragments used to represent faces, for instance, would be 
different from shape fragments used to represent cars, or letters in the alphabet. 
These fragments are then used as a set of common building blocks to represent, by 
different combinations of the fragments, different objects belonging to the class. 
Second, the fragments we use as building blocks are extracted using an optimization 
process that is driven directly by requirements of the classification task. This is in 
contrast with other scheme where the basic building elements are selected on the basis 
of other criteria, such as faithful reconstruction of the input image. Third, the 
fragments we detect are organized into equivalence sets that contain views of the 
same general region in the objects under different transformations and viewing 
conditions. As we will see later, this novel organization plays a useful role in 
performing the classification task. 

The use of the combination of image fragments to deal with intra-class variability 
is based on the notion that images of different objects within a class have a particular 
structural similarity - they can be expressed as combinations of common 
substructures. Roughly speaking, the idea is to approximate a new image of a face, 
say, by a combination of images of partial regions, such as eyes, hairline and the like 
of previously seen faces. In this section we describe briefly the process of selecting 
class-based fragments for representing a collection of images within a class. We will 
focus here mainly on computational issues, possible biological implications will be 
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discussed elsewhere. In the following sections, we describe the use of the fragment 
representation for performing classification tasks. 



3.1 Selecting Informative Fragments 

Given a set of images that represent different objects from a given class, our scheme 
selects a set of fragments that are used as a basis for representing the different shapes 
within the class. Examples of fragments for the class of human faces (roughly 
frontal) and the class of cars (sedans, roughly side views) are illustrated in Figure 1. 
The fragments are selected using a criterion of maximizing the mutual information 
I(C,F) between a class C and a fragment F. This is a natural measure to employ, 
because it measures how much information is added about the class once we know 
whether the fragment F is present or absent in the image. In the ensemble of natural 
images in general, prior to the detection of any fragment, there is an a-priori 
probability p(C) for the appearance of an image of a given class C. The detection of a 
fragment F adds information and reduces the uncertainty (measured by the entropy) of 
the image. We select fragments that will increase the information regarding the 
presence of an image from the class C by as much as possible, or, equivalently, 
reduce the uncertainty by as much as possible. This depends on p(F|C), the 
probabilities of detecting the fragment F in images that come from the class C, and on 
p(F|NC) where NC is the complement of C. 

A fragment F is highly representative of the class of faces if it is likely to be found 
in the class of faces, but not in images of non-faces. This can be measured by the 
likelihood ratio p(F|C) / p(F|NC). Fragments with a high likelihood ratio are highly 
distinctive for the presence of a face. However, highly distinctive features are not 
necessarily useful fragments for face representation. The reason is that a fragment 
can be highly distinctive, but very rare. For example, a template depicting an 
individual face is highly distinctive: its presence in the image means that a face is 
virtually certain to be present in the image. However, the probability of finding this 
particular fragment in an image and using it for making classification is low. On the 
other hand, a simple local feature, such as a single eyebrow, will appear in many more 
face images, but it will appear in non-face images as well. The most informative 
features are therefore fragments of intermediate size. In selecting and using optimal 
fragments for classification, we distinguish between what we call the 'merit’ of a 
fragment and its 'distinctiveness’. The merit is defined by the mutual information 

I(C,F) = H(C) - H(C/F) (1) 

where I is the mutual information, and H denotes entropy [6]. The merit measures 
the usefulness of a fragment F to represent a class C, and the fragments with maximal 
merit are selected as a basis for the class representation. The distinctiveness is 
defined by the likelihood ratio above, and it is used in reaching the final classification 
decision, as explained in more detail below. Both the merit and the distinctiveness can 
be evaluated given the estimated value of p(C), p(F|C), p(F|NC). In summary. 
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fragments are selected on the basis of their merit, and then used on the basis of their 
distinctiveness. 
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Fig. 1. Examples of face and car fragments 



Our procedure for selecting the fragments with high mutual information is the 
following. Given a set of images, such as cars, we start by comparing the images in a 
pairwise manner. The reason is that a useful building block that appears in multiple 
car images must appear, in particular, in two or more images, and therefore the 
pairwise comparison can be used as an initial filter for identifying image regions that 
are likely to serve as useful fragments. We then perform a search of the candidate 
fragments in the entire database of cars, and also in a second database composed of 
many natural images that do not contain cars. In this manner we obtain estimations 
for p(F|C) and p(F|NC) and, assuming a particular p(C), we can compute the 
fragment’s mutual information. For each fragment selected in this manner, we extend 
the search for optimal fragments by testing additional fragments centered at the same 
location, but at different sizes, to make sure that we have selected fragments of 
optimal size. The procedure is also repeated for searching an optimal resolution rather 
than size. 

We will not describe this procedure in further detail, except to note that large 
fragments of reduced resolution are also highly informative. For example, a full-face 
fragment at high resolution is non-optimal because the probability of finding this 
exact high-resolution fragment in the image is low. However, at a reduced resolution, 
the merit of this fragment is increased up to an optimal value, at which it starts to 
decrease. In our representation we use fragments of intermediate complexity in either 
size or resolution, and it includes full resolution fragments of intermediate size, and 
larger fragments of intermediate resolution. 



4 Performing Classification 

The set of fragments extracted from the training images during the learning stage are 
then used to classify new images. In this section we consider the problem of 
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performing object classification based on the fragment representation described 
above. 

In performing classification, the task is to assign the image to one of a known set 
of classes (or decide that the image does not depict any known class). In the 
following discussion, we consider a single class, such as a face or a car, and the task 
is to decide whether or not the input image belongs to this class. This binary decision 
can also be extended to deal with multiple classes. We do not assume that the image 
contains a single object at a precisely known position, consequently the task includes 
a search over a region in the image. We can therefore view the classification task also 
as a detection task, that is, deciding whether the input image contains a face, and 
locating the position of the face if it is detected in the image. 

The algorithm consists of two main stages: detecting fragments in the image, 
followed by a decision stage that combines the results of the individual fragment 
detectors. In the first stage, fragments are detected by comparing the image at each 
location with stored fragment views. The comparison is based on gray-level 
similarity, that is insensitive to small geometric distortions and gray level variations. 
As for the combination stage, we have compared two different approaches: a simple 
and a complex scheme. The simple scheme essentially tests that a sufficient number 
of basic fragments have been detected. The more complex scheme is based on a more 
complete probability distribution of the fragments, and takes into account 
dependencies between pairs of fragments. We found that, using the fragments 
extracted based on informativeness, the simple scheme is powerful and works almost 
as well as the more elaborate scheme. This finding is discussed later in this section. 

In the following sections we describe the algorithm in more details. We begin by 
describing the similarity measure used for the detection of the basic fragments. 



5 Detecting Individual Fragments 



The detection of the individual fragment is based on a direct gray-level comparison 
between stored fragments and the input image. To allow some distortions and scale 
changes of a fragment, the comparison is performed first on smaller parts of the 
fragments, that were taken in the implementation to be patches of size 5*5 pixels. We 
describe first the gray-level similarity measure used in the comparison, and then how 
the comparisons of the small patches are used to detect individual fragments. 

We have evaluated several gray-level comparison methods, both known and new, 
to measure similarity between gray level patches in the stored fragment views and 
patches in the input image. Many of the comparison methods we tested gave 
satisfactory results within the subsequent classification algorithm, but we found that a 
method that combined qualitative image based representation suggested by Bhat and 
Nayar [2] with gradient and orientation measures gave the best results. The method 
measured the qualitative shape similarity using the ordinal ranking of the pixels in the 
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regions, and also measured the orientation difference using gradient amplitude and 
direction. For the qualitative shape comparison we computed the ordinal ranking of 
the pixels in the two regions, and used the normalized sum of displacements of the 
pixels with the same ordinal ranking as the measure for the regions similarity. 

The similarity measure D(F,H) between an image patch FI and a fragment patch F 
is a weighted sum of their sum of these displacements , the absolute orientation 
difference of the gradients \ap - | and their absolute gradient difference 

\Gp - GJ: 



D(F,H) = k^'^d. +k^\ap - a„\ + k^\Gp -G^| (2) 

i 

This measure appears to be successful because it is mainly sensitive to the local 
structure of the patches and less to absolute intensity values. 

For the detection of fragments in the images we first compared local 5x5 gray level 
patches in each fragment to the image, using the above similarity measure. Only 
regions with sufficient variability were compared, since in flat-intensity regions the 
gradient, orientation and ordinal-order have little meaning. We allowed flexibility in 
the comparison of the fragment view to the image by matching each pixel in the 
fragment view to the best pixel in some neighborhood around its corresponding 
location. Most of the computations of the entire algorithm are performed at this stage. 
To detect objects at different scale in the image, the algorithm is performed on the 
image at several scales. Each level detects objects at scale differences of ±35%. The 
combination of several scales enables the detection of objects under considerable 
changes In their size. 



6 Combining the Fragments and Making a Decision 



Following the detection of the individual fragments, we have a set of ‘active’ 
fragments, that is, fragments that have been detected in the image. We next need to 
combine the evidence from these fragments and reach a final decision as to whether 
the class of Interest is present in the image. 

In this section we will consider two alternative methods of combining the evidence 
from the individual fragments and reaching a decision. Previous approaches to visual 
recognition suggest there is a natural trade-off between the use of simple visual 
features that require a complex combination scheme, and the use of more complex 
features, but with a simpler combination scheme. 

A number of recognition and classification schemes have used simple local image 
features such as short oriented lines, corners, Gabor or wavelet basis functions, or 
local texture patches. Such features are generic in nature, that is, common to all 
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visual classes. Consequently, the combination scheme must rely not only on the 
presence in the image of particular features, but also on their configurations, for 
example, their spatial relations, and pair-wise or higher statistical interdependencies 
between the features. A number of schemes using this approach [1, 14, 26, 33], 
therefore employ, in addition to the detection of the basic features, additional 
positional information, and probability distribution models of the features. In contrast, 
a classifier that uses more complex, class-specific visual features could employ a 
simpler combination scheme because the features themselves already provide good 
evidence about the presence of the class in question. In the next section, we first 
formulate the classification as a problem of reaching optimal decision using 
probabilistic evidence. We then compare experimentally the classification 
performance of a fragment-based classifier that uses a ‘simple’ combination method, 
and one that uses a more complex scheme. 



6.1 Probability Distribution Models 

We can consider in general terms the problem of reaching a decision about the 
presence of a class in the images based on some set of measurements denoted by X. 
Under general conditions, the optimal decision is obtained by evaluating the 
likelihood ratio defined as: 

P{^ I Q) . where P(X\Co), P(X\Ci) are the conditional probabilities of X within 

P(Z|C„) 

and outside the class. The elements of X express, in the fragment-based scheme, the 
fragments that have been detected in the image. 

The direct use of this likelihood ratio in practice raises computational problems in 
learning and storing the probability functions involved. A common solution is to use 
restricted models using assumptions about the underlying probability distribution of 
the feature vector. In such models, the number of parameters used to encode the 
probability distribution is considerably smaller than for a complete look-up table 
representation, and these parameters can be estimated with a higher level of 
confidence. 

A popular and useful method for generating a compact representation of a 
probability distribution is the use of Belief-Networks, or Bayesian-Networks. A 
Bayesian-Network is a directed graph where each node represents one of the variables 
used in the decision process. The directed edges correspond to dependency 
relationships between the variables, and the parameters are conditional probabilities 
between inter-connected nodes. A detailed description of Bayesian-Networks can be 
found in [19]. The popularity of Bayesian-Network methods is due in part to their 
flexibility and ability to represent probability distributions with dependencies between 
the variables. There are several methods that enable the construction of a network 
representation from training data, and algorithms that efficiently compute the 
probability of all the variables in the network given the observed values of some of 
the variables. In the following section, we use this formalism to compare two 
methods for combining the evidence from the fragments detected in the first stage. 
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One is a simple scheme sometimes called ‘naive Bayesian’ method, and the second a 
more elaborate scheme using a Bayesian-Network type method. 

6.2 Naive-Bayes 

The assumption underlying the naive-Bayes classifier is that the entries of the feature 
vector can be considered independent when the computing the likelihood 

ratio I ^ 1 ) . In this case, the class-conditional probabilities can be expressed by 
P{X\C,) 
the product: 

P(Xj,...X^|C) = nP(X,|C) (3) 

i=l 

The values of the single probabilities are directly measured from the training data. 
In practice, this means that we first measure the probability of each fragment X; within 
and outside the class. To reach a decision we simply multiply the relevant 
probabilities together. This method essentially assumes independence between the 
different fragments. (More precisely, it assumes conditional independence.) The 
actual computation in our classification scheme was performed using the fragment 
types, rather than the fragments themselves. This means that for each fragment type 
(such as a hairline or eye region), the best-matching fragment was selected. The 
combination then proceeds as above, by multiplying the probabilities of these 
fragments, one from each type. 

6.3 Dependence-Tree Combination 

The Dependence-tree model is a simple Bayesian-Network that describes a 
probability distribution which incorporates some relevant pairwise dependencies 
between variable, unlike the independence assumptions used in the naive-Bayes 
scheme. The features are organized into a tree structure that represents statistical 
dependencies in a manner that allows an efficient computation of the probability of an 
input vector. The tree structure permits the use of some, but not all, of the 
dependencies between features. An optimal tree representation can be constructed 
from information regarding pairwise dependencies in the data [5]. The probability of 
an input vector is computed by multiplying together the probabilities of each node 
given the value of its parent. More formally: 

P{X„...X, I C) = P{X, I C) X P{X, I Z,,,, C) (4) 

i=2 

where j(i) represents the parent of node i. Xi is the root of the tree (which 
represent the class variable), that does not have a parent. The conditional probabilities 
are estimated during the learning phase directly from the training data. 
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6.4 The Trade-Off between Feature Complexity and Combination Complexity 

We have implemented and compared the two schemes outlined above. This allows us 
to compare a simple combination scheme that is based primarily on the presence or 
absence of fragments in the image, and a more elaborate scheme that uses a more 
refined model of the probability distribution of the fragments. Figure 3 shows the 
performance of a fragment-based classifier trained to detect side views of cars in low- 
resolution images, using both combination schemes. 

The results are presented in the form of Receiver Operating Characteristic (ROC) 
curves [9]. Each point in an ROC curve represents a specific pair of false-alarms and 
hit-rate of the classifier, for a given likelihood ratio threshold. The efficiency of a 
classifier can be evaluated by the ‘height’ of its ROC curve: for a given false-alarm 
rate, the better classifier will be the one with higher hit probability. The overall 
performance of a classifier can be measured by the area under the performance curve 
in the ROC graph. 

The curves for both methods are almost identical, showing that including pairwise 
dependencies in the combination scheme, rather than using the information of each 
feature individually, has a marginal effect on the classifier performance. This suggests 
that most of the useful information for classification is encoded in the image 
fragments themselves, rather than their inter-dependencies. This property of the 
classifier depends on the features used for classification. When simple generic 
features are used, the dependencies between features at different locations play an 
important role in the classification process. However, when more complex features 
are used, such as the ones selected by our information criterion, then a simpler 
combination scheme will suffice. 




Probability of fel&e alarm 



Fig. 2. Receiver Operating Characteristic curves for both classifiers. NB: Naive-Bayes 
classifier. DT: Dependence-Tree classifier. 
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7 Experimental Results 



We have tested our algorithm on face and car views. A number of similar experiments 
were performed, using somewhat different databases and different details of the 
fragment extraction and combination procedures. In an example experiment, we 
tested face detection, using a set of 1 104 part views, taken from a set of 23 male face 
views under three illuminations and three horizontal rotations. The parts were 
grouped into eight fragment types - eye pair, nose, mouth, forehead, low-resolution 
view, mouth and chin, single eye and face outline. For cars, we used 153 parts of 6 
types. Figure 5 shows several examples. Note that although the system used only male 
views in few illuminations and rotations, it detects male and female face views under 
various viewing conditions. 

The results of applying the method to these and other images indicate that the 
fragment-based representation generalizes well to novel objects within the class of 
interest. Using face fragments obtained from a small set of examples it was possible 
to classify correctly diverse images of males and females, in both real images and 
drawings, that are very different from the faces in the original training set. This was 
achieved while maintaining low false alarm rates on images that did not contain faces. 
Using a modest number of informative fragments in different combinations, appears 
to have an inherent capability to deal with shape variability within the class. The 
fragment-based scheme was also capable of obtaining significant position invariance, 
without using explicit representation of the spatial relationships between fragments. 
The insensitivity to position as well as to other viewing parameters was obtained 
primarily by the use of a redundant set of overlapping fragments, including fragments 
of intermediate size and higher resolution, and fragments of larger size and lower 
resolution. 



8 Some Analogies with the Human Visual System 



In visual areas of the primate cortex neurons respond optimally to increasingly 
complex features in the input. Simple and complex cells in the primary visual area 
(VI) responds best to a line or edge at a particular orientation and location in the 
visual field [10]. In higher-order visual areas of the cortex, units were found to 
respond to increasingly complex local patterns. For example, V2 units respond to 
collinear arrangements of features [31], some V4 units respond to spiral, polar and 
other local shapes [8], TE units respond to moderately complex features that may 
resemble e.g. a lip or an eyebrow [27], and anterior IT units often respond to complete 
or partial object views [11, 24]. Together with the increase in the complexity of their 
preferred stimuli, units in higher order visual areas also show increased invariance to 
viewing parameters, such as position in the visual field, rotation in the image plane, 
rotation in space, and some changes in illumination [11, 20, 24, 27]. 
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Fig. 3. Examples of face and car detection. The images are f rom the Weizmann im age 
database, from the CMU face detector gallery, and the last two from ^wwmiotorcitiesxo^ 
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The preferred stimuli of IT units are highly dependent upon the visual experience of 
the animal. In monkeys trained to recognize different wire objects, units were found 
that respond to specific full or partial views of such objects [11]. In animals trained 
with fractal-like images, units were subsequently found that respond to one or more of 
the images in the training set [16]. 

These findings are consistent with the view that the visual system uses object 
representations based on class related fragments of intermediate complexity, 
constructed hierarchically. The preferred stimuli of simple and intermediate 
complexity neurons in the visual system are specific 2-D patterns. Some binocular 
information can also influence the response, but this additional information, which 
adds 3-D information associated with a fragment under particular viewing conditions, 
can be incorporated in the fragment based representation. The lower level features are 
simple generic features. The preferred stimuli of the more complex units are 
dependent upon the family of training stimuli, and appear to be class-dependent rather 
than, for example, a small set of universal building blocks. Invariance to viewing 
parameters such as position in the visual field or spatial orientation appears gradually, 
possibly by the convergence of more elementary and less invariant fragments onto 
higher order units. From this theory we can anticipate the existence of two types of 
intermediate complexity units that have not been reported so far. First, for the 
purpose of classification, we expect to find units that respond to different types of 
partial views. As an example, a unit of this kind may respond to different shapes of 
hairline, but not to a mouth or nose regions. Second, because the invariance of 
complex shapes to different viewing parameters is inherited from the invariance of the 
more elementary fragment, we expect to find intermediate complexity units, 
responding to partial object views at a number of different spatial orientations and 
perhaps different illumination conditions. 
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Abstract. The paper introduces a new approximation scheme for 
planar digital curves. This scheme defines an approximating sausage 
‘around’ the given digital curve, and calculates a minimum-length 
polygon in this approximating sausage. The length of this polygon is 
taken as an estimator for the length of the curve being the (unknown) 
preimage of the given digital curve. Assuming finer and finer grid 
resolution it is shown that this estimator converges to the true perimeter 
of an r-compact polygonal convex bounded set. This theorem provides 
theoretical evidence for practical convergence of the proposed method 
towards a ‘correct’ estimation of the length of a curve. The validity of 
the scheme has been verified through experiments on various convex 
and non-convex curves. Experimental comparisons with two existing 
schemes have also been made. 

Keywords: Digital geometry, digital curves, multigrid convergence, 
length estimation. 



1 Introduction and Preliminary Definitions 

Approximating planar digital curves is one of the most important topics in im- 
age analysis. An approximation scheme is required to ensure convergence of esti- 
mated values such as curve length toward the true length assuming a digitization 
model and an increase in grid resolution. For example, the digital straight segment 
approximation method (DSS method), see [3 IS] , and the minimum length poly- 
gon approximation method assuming one-dimensional grid continua as boundary 
sequences (MLP method), see 0, are methods for which there are convergence 
theorems when specific convex sets are assumed to be the given input data, see 
EECni. This paper studies the convergence properties of a new minimum length 
polygon approximation method based on so-called approximation sausages (AS- 
MLP method). 

Motivations for studying this new technique are as follows: the resulting DSS 
approximation polygon depends upon starting point and the orientation of the 
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boundary scan, it is not uniquely defined, but it may be calculated for any 
given digital object. The MLP approximation polygon is uniquely defined, but 
it assumes a one-dimensional grid continua as input which is only possible if 
the given digital object does not have cavities of width 1 or 2. The new method 
leads to a uniquely defined polygon, and it may be calculated for any given 
digital object. 

Let r be the grid resolution defined as being the number of grid points per 
unit. We consider r-grid points gl j = {ilr,jjr) in the Euclidean plane, for inte- 
gers i,j. Any r-grid point is assumed to be the center point of an r- square with 
r-edges of length 1/r parallel to the coordinate axes, and r-vertices. 

The digitization model for our new approximation method is just the same 
as that considered in case of the DSS method, see |4I5I6| . That is, let S' be a 
set in the Euclidean plane, called real preimage. The set Cr{S) is the union of 
all those r-squares whose center point glj is in S. The boundary dCr{S) is the 
r- frontier of S. Note that dCr{S) may consists of several non-connected curves 
even in the case of a bounded convex set S. A set S is r -compact iff there is a 
number rg > 0 such that dCr{S) is just one (connected) curve, for any r > tq. 
This definition of r-compactness has been introduced in p[j in the context of 
showing multigrid convergence of the DSS method. 

The validity of the proposed scheme has been verified through experiments 
on various curves, which are described in Section 0 It has also been compared 
with the existing schemes in convergence and computation time. 

2 Approximation Scheme 

Given a connected region S in the Euclidean plane and a grid resolution r, the 
r-frontier of S is uniquely determined. We consider r-compact sets S, and grid 
resolutions r > rs for such a set, i.e. dCr{S) is just one (connected) curve. In such 
a case the r-frontier of S can be represented in the form P = (vq, vi, , Vn-i) 
in which the vertices are clockwise ordered so that the interior of S lies to the 
right of the boundary. Note that all arithmetic on vertex indices is modulo n. 

Let (5 be a real number between 0 and l/(2r). For each vertex of P we define 
forward and backward shifts: The forward shift f{vi) of Vi is the point on the 
edge (vi,Vi+i) at the distance 6 from Vi. The backward shift b{vi) is that on the 
edge {vi-i,Vi) at the distance 5 from Vi. 

For example, in the approximation scheme as detailed below we will replace 
an edge (vi,Vi+i) by a line segment {vi, f{vi+i)) interconnecting Vi and the 
forward shift of which is referred to as the forward approximating segment 
and denoted by Lf{vi). The backward approximating segment {vi,b{vi-i)) is 
defined similarly and denoted by Lh{vi). Refer to Fig.0for illustration. Now we 
have three sets of edges, original edges of the r-frontier, forward and backward 
approximating segments. Let 0 < S < .5/r. Based on these edges we define a 
connected region A^(S'), which is homeomorphic to the annulus, as follows: 

Given a polygonal circuit P describing an r-frontier in clockwise orientation. 
By reversing P we obtain a polygonal circuit Q in counterclockwise order. In 
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Fig. 1. De nition of the forward and backward approximating segments associated 
with a vertex Vi. 



the initialization step of our approximation procedure we consider P and Q as 
the external and internal bounding polygons of a polygon Pg homeomorphic to 
the annulus. It follows that this initial polygon Pg has area contents zero, and 
as a set of points it coincides with dCr{S). 

Now we ‘move’ the external polygon P ‘away’ from Cr{S), and the internal 
polygon Q ‘into’ Cr{S) as specified below. This process will expand Pb step 
by step into a final polygon which contains dCr{S), and where the Hausdorff 
distance between P and Q becomes non-zero. For this purpose, we add forward 
and backward approximating segments to P and Q in order to increase the area 
contents of the polygon Pb ■ 

To be precise, for any forward or backward approximating segment Lf(vi) or 
Lbi'Vi) we first remove the part lying in the interior of the current polygon Pb 
and updating the polygon Pb by adding the remaining part of the segment as a 
new boundary edge. The direction of the edge is determined so that the interior 
of Pg lies to the right of it. 

Definition 1. The resulting polygon Pg is referred to as the approximating 
sausage of the r-frontier and denoted by Af{S). 

The width of such an approximating sausage depends on the value of <5. It is 
easy to see that as far as the value of 6 is at most half of the grid size, i.e., 
less or equal l/(2r), the approximating sausage is well defined, that is, it 

has no self-intersection. It is also immediately clear from the definition that the 
Hausdorff distance from the r-frontier dCr{S) to the boundary of the sausage 
Af.{S) is at most S < l/(2r). 

We are ready to define the final step in our AS-MLP approximation scheme 
for estimating the length of a digital curve. Our method is similar to that of the 
MLP as introduced in ^j. 

Definition 2. Assume a region S having a connected r-frontier. An AS-MLP 
curve for approximating the boundary of S is defined as being a shortest closed 
curve 7 r(S') lying entirely in the interior of the approximating sausage Af.{S), 
and encircling the internal boundary of Af{S). 

It follows that such an AS-MLP curve 7^(5') is uniquely defined, and that it is a 
polygonal curve defined by finitely many straight segments. Note that this curve 
depends upon the choice of the approximation constant 6. An example of such 
a shortest closed curve 7^ (S') is given in Fig. 0 with S = .bjr. 
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Fig. 2. Left: construction of approximating sausage. Right: approximation by shortest 
internal path. 



3 Properties of the Digital Curve 

We discuss some of the properties of the approximating polygonal curve 

defined above, assuming that dCr{S) is a single connected curve. 

Non-selfintersection: The AS-MLP curve 7 ^ (>5') is defined as being a shortest 
closed curve lying in the approximating sausage. Since it is obvious from the 
definition that the sausage has no self-intersection, so does the curve. 

Controllability: The width of an approximating sausage can be controlled by 
selecting a value of <5, with 0 < <5 < .5/r. 

Smoothness: Compared with the other two approximation schemata DSS and 
MLP, our approximating curve is ‘more smooth’ in the following sense: the 
angle associated with a corner of an approximating polygon is the smaller one 
of its internal angle and external angle. We consider the minimum angle of all 
these angles associated with a corner of the AS-MLP curve. Similarly, such 
minimum angles may be defined for approximating DSS and MLP curves. 
It holds that the minimum AS-MLP angle is always greater than or equal 
to the minimum DSS or minimum MLP angle, if a convex set S has been 
digitized. Note that ‘no small angle’ means ‘no sharp corner’. 

Linear complexity: Due to the definition of our curve 7 ^ (*5') the number of 
its vertices is at most twice that of the r-frontier. 

Computational complexity: Assuming that a triangulation of an approxi- 
mating sausage is given, linear computation time suffices to find a shortest 
closed path: we can triangulate an approximating sausage in linear time since 
the vertices of the sausage can be calculated only using nearby segments. So, 
linear time is enough to triangulate it. Then, we can construct an adjacency 
graph, which is a tree, representing adjacency of triangles again in linear 
time. Finally, we can find a shortest path in linear time by slightly modi- 
fying the linear-time algorithm for finding a shortest path within a simple 
polygon. 

Figure 0 gives visual comparisons of the proposed AS-MLP method with two 

existing schemes DSS and MLP. 
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Fig. 3. Original region with DSS (left), MLP (center), and proposed approximation 
using 5 = .5/r (right). 



4 Convergence Theorem 

In this section we prove the main result of this paper about the multigrid con- 
vergence of the AS-MLP curve based length estimation of the perimeter of a 
given set S. 

Theorem 1. The length of the approximating polygonal curve 7 r(*S') converges 
to the perimeter of a given region S if S is a r -compact polygonal convex bounded 
set and 0 < <5 < .5/r. 

We sketch a proof of this theorem with an investigation of geometric properties 
of the r-frontier of a convex polygonal region S. 

We first classify r-grid points into interior and exterior ones depending on 
whether they are located inside of the region S or not. Then, CHm is defined 
to be the convex hull of the set of all interior r-grid points. CHout is the convex 
hull of the set of those exterior r-grid points adjacent horizontally or vertically 
to interior ones. See Fig. 0 for illustration. 

Lemma 1. The difference between the lengths of CHin and CHout exactly 
4\/2/r. 




Fig. 4. Interior r-grid points (filled circles) and exterior points (empty circles) with 
the convex hulls CHi„ of a set of interior points and CHout of a set of exterior points 
adjacent to interior ones. 
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Now, we are ready to state the following lemma which is of crucial importance 
for proving the convergence theorem. 

Lemma 2. Given an r-compact polygonal convex hounded set S, the approxi- 
mating polygonal curve 7j?(S') is contained in the region hounded hy CHin and 
CHout, for 0 < S < .5/r. 

Let CH be the convex hull of the set of vertices of the approximating polyg- 
onal curve 7r(S'). The convex hull CH is also bounded by CHin and CHout- 
Obviously, the vertices of CH are all intersections of approximating segments. 
Furthermore, exterior intersections do not contribute to CH, where external (in- 
ternal, resp.) intersections are those on the external (internal, resp.) boundary 
of the approximating sausage. Therefore, we can evaluate the perimeter of CH. 
An increase in distance of an internal intersection from the boundary of CHin 
corresponds to an increase in length of an approximating segment, and a de- 
crease of distance of its associated intersection to the inner hull CHin- Thus, 
such an intersection is farthest at a corner defined by two unit edges. Thus, the 
maximum distance from CHin to CH is bounded by which implies that 

the perimeter of CiL is bounded by ^ 121 : jZ. 

Lemma 3. Let CH he the convex hull of all internal intersections defined above. 
Then, the approximating polygonal curve 7(?(S') lies between the two convex hulls 
CHin and CH. The maximum gap between CHin and CH is hounded by 
and for their perimeter we have 

Perimeter(CH) < PerimeteifCHin) A'/^jr. (1) 

So, if the approximating polygonal curve 7^(5') is convex, then we are done. 
Unfortunately, it is not always convex. In the remaining part of this section we 
evaluate the largest possible difference on lengths between 7 (?(S') and CH. 

Lemma 4. The approximating polygonal curve 7r(>S') is concave when two con- 
secutive long edges of lengths di-i and di with intervening unit edge satisfy 
di > 3di—i -\- 1 . 

By analysis of the possible differences from the convex chain, we obtain the 
following theorem. 

Theorem 2. Let S he a hounded, convex polygonal set. Then, there exists a grid 
resolution ro such that for allr > rg it holds that any AS-MLP approximation of 
the r-frontier dCr{S), with 0 < <5 < .5/r, is a connected polygon with a perimeter 
Ir and 

\Perimeter{S) - lr\ < (4^2 -f 8 * 0.0234) /r = 5.844/r. (2) 

5 Experimental Evaluation 

We have seen above that the perimeter estimation error by AS-MLP is bounded 
in theory by 5.8/r for a grid resolution r, for convex polygons. To illustrate 
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Fig. 5. Experimental objects. 



its practical behavior we report on experiments on various curves, which are 
described below. Although we have restricted ourselves to convex objects in the 
preceding proof, we took non-convex curves as well in these experiments. Figure 
El illustrates a set of objects used for experiments as suggested in EJ- 

CIRCLE: the equation of the circle is 

(a:-0.5)^-h(?/-0.5)^ = 0.4^ 

YIN YANG: the lower part of the yinyang symbol is composed by arcs of 3 
half circles: the lower arc is a part of CIRCLE, and the upper arcs are parts of 
circles whose sizes are half of CIRCLE. 

LUNULE: this object is the remainder of two circles, where the distance 
between both center points is 0.28. 

SINC: the sine equation corresponding to the upper curve is 




SQUARE: the edges of the isothetic SQUARE are of length 0.8. 



5.1 Two Existing Approximation Schemes 

We sketch both existing schemes which are used for comparisons, where the DSS 
and MLP implementation reported in ^ has been used for experimental evalua- 
tion. First, the digital straight segment (DSS) algorithm traces an r-frontier, i.e. 
vertices and edges on dC{S), i.e. a boundary of C(S'), and detects a consecutive 
sequence of maximum length DSSs. The sum of the lengths of these DSS is used 
as DSS curve length estimator. The DSS algorithm runs in linear time. 

The minimum-length polygon (MLP) approximation needs two boundaries, 
of set I{S) and of set 0{S), as input. Roughly saying, I{S) is the union of r- 
squares that are entirely included in S', in other words, all four r-vertices of such 
a square are included in a convex set S; and 0{S) is obtained by ‘expanding’ 
/(S) by a dilation using one r-square as structuring element. The MLP algorithm 
calculates the shortest path in the area 0{S)\I{S) circumscribing the boundary 
of /(S). The length of such a shortest path is used as MLP curve length estimator. 
The MLP algorithm also takes linear time. 
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Fig. 6. Test sets drawn in unit size. 



In the experiments we computed the errors of three approximation schemes 
for the specified objects digitized in grid resolutions r = 32 ^ 1024. For DSS 
and AS-MLP, C{S) was used as a digitized region, where C{S) is a set of pixels 
whose midpoints are included in S. For MLP, I{S) and its expansion was used. 



5.2 Experiments 



Following the given implementations of DSS, and MLP, also our new AS-MLP 
scheme has been implemented in C-|— I- for comparisons. We have computed the 
curve length error in percent compared to the true perimeter of a given set S. 
The error Ejjgg of the DSS estimation scheme is defined by 



Edss 



P{S) - PjPSSg) 
~ P{S)~ 



where P{S) is the true perimeter of S and P{DSSg) is the perimeter of the 
approximation polygon given by the DSS scheme. Emlp and Easmlp are anal- 
ogously defined. 

Figure Q shows the errors for all five test curves, the boundaries of CIRCLE, 
YINYANG, LUNULE, SINC, and SQUARE in that order, from top to bottom. 
The diagrams for DSS, MLP, and AS-MLP are arranged from left to right, in 
each row of the figure. The graphs illustrate that AS-MLP has smaller errors in 
general than MLP has, but DSS is the best among the three. 



6 Conclusion 

We proposed a new approximation scheme for planar digital curves and analyzed 
its convergence to the true curve length by stating a theorem for convex sets. To 
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verify its practical performance we have implemented this scheme and tested it on 
various curves including non-convex ones. The results reflected the theoretical 
analysis of the three schemes, that is, DSS is the best in accuracy, and our 
scheme is in the middle. The AS-MLP approximation curves are smoother (see 
our definition above) than the MLP or DSS curves. 



Acknowledgment. The used C++ programs for DSS and MLP are those im- 
plemented for and described in and the test data set has been copied from 

0 . 
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Abstract. Digital distance transforms are useful tools for many image 
analysis tasks. In the 2D case, the maximum difference from Euclidean 
distance is considerably smaller when using a 5 x 5 neighbourhood com- 
pared to using a 3 X 3 neighbourhood. In the 3D case, weighted distance 
transforms for neighbourhoods larger than 3x3x3 has almost not been 
considered so far. We present optimal local distances for an extended 
neighbourhood in 3D, where we use the three weights in the 3x3x3 
neighbourhood together with the (2, 1, 1) weight from the 5x5x5 neigh- 
bourhood. A good integer approximation is shown to be (3, 4, 5, 7). 



1 Introduction 

Digital distance transforms have been used for computing distances in images 
since the 60s, iazi. The basic idea is to, by using propagation of local distances, 
approximate the Euclidean distance in a computationally convenient way. The 
reason the Euclidean distance is approximated it that it is rotation independent 
(up to digitization effects). The distance transform can be computed during two 
scans, one forward and one backward, over the image, considering only a small 
neighbourhood around each pixel/ voxel. 

Compared to the earliest approach, which was based on the computation of 
the number of steps in a minimal path, the Euclidean distance can be better 
approximated by using different weights for steps representing different neigh- 
bour relations. For 2D images, optimal local distances in neighbourhoods up to 
13 X 13 have been investigated m The maximum difference to the Euclidean 
distance was considerably smaller when using a 5 x 5 neighbourhood compared 
to a 3 X 3 neighbourhood, 1.41% compared to 4.49% for optimal weights. 

In a distance transform certain regularity criteria for the weights need to 
be satisfied. In fact, only rather limited intervals for the different weights are 
allowed. In necessary conditions for an nQ distance transform to be a metric 
are presented. In 3D, optimization of local distances, in a 3 x 3 x 3 neighbourhood, 
fulfilling these conditions is presented in P| . It was shown that there are two dif- 
ferent valid cases, for which the distance functions, and hence the optimizations, 
are different. 
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In this paper, we will compute the optimal local distances for an extended 
neighbourhood in 3D, where we use the three weights in the 3x3x3 neighbour- 
hood together with the (2, 1, 1) weight from the 5x5x5 neighbourhood. As for 
the 3x3x3 neighbourhood, different cases occur, more precisely eight difference 
cases. We will illustrate all cases and optimize for the most straightforward of 
them. 

An early attempt to find optimal local distances in a 5 x 5 x 5 neighbourhood 
can be found in 0. However, nothing is mentioned about the validity of the 
resulting distance transforms with respect to the above mentioned regularity 
criteria. Weights for 3D distance transforms in extended neighbourhoods have 
recently also been treated in |^. There, regularity and optimization criteria 
different from ours are used, and only integer weights are considered. 

2 Geometry and Equations 

Consider a 3D bi-level image, consisting of object and background voxels. In the 
corresponding distance image, or a distance transform (DT), each voxel in the 
object can be labelled with the distance to the closest voxel in the background. 
A good underlying concept for all digital distance, introduced in HU], is: 

Definition 1. The distance between two points x and y is the length of the 
shortest path connecting x to y in an appropriate graph. 

Each voxel has three types of neighbours in its 3 x 3 x 3 neighbourhood: 6 
face neighbours, 12 edge neighbours, and 8 point neighbours. A path between 
two voxels in a 3D image can thus include steps in 26 directions, if only steps 
between immediate neighbours are allowed. In a 5 x 5 x 5 neighbourhood, each 
voxel has 124 neighbours, belonging to one of six different types, i.e., the types 
in the 3x3x3 neighbourhood plus three new types d, e, and /. See Fig. [H for 
the various possible steps in a 5 x 5 x 5 neighbourhood (each of course occur in 
several rotations) . These six possible steps are usually called prime steps or local 
steps. The corresponding distances are called local distances. 




Step a Step b Step c Step d Step e Step / 

( 1 , 0 , 0 ) ( 1 , 1 , 0 ) ( 1 , 1 , 1 ) ( 2 , 1 , 0 ) ( 2 , 1 , 1 ) ( 2 , 2 , 1 ) 



Fig. 1. Local distances in a 5 x 5 x 5 neighbourhood of a voxel (grey). 



Optimizing for all six local distances is complicated and may not be necessary 
to discover new useful weighted distance transforms that are considerably better 
than the ones using only a 3 x 3 x 3 neighbourhood. Here, we add only one of 
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the three possible steps in the 5x5x5 neighbourhood, the e step. Adding this 
step can be assumed to make the most difference to the 3x3x3 weighted DT, 
as its Euclidean length, -\/6, provides the largest decrease compared to using two 
a, b, c steps, a + c = 1 + -\/3 (cf. d = a + b, f = b+c). Expressed in another 
way: the “corner” cut off by this new step is the sharpest of the three d, e, or / 
corners. The new distance is denoted (a, 6, c, e). 

Let DT(i, j, k) for an object voxel v(i,j,k) be the minimal length of a 
path connecting v to any voxel w in the background using only the local 
steps {a,b,c,e). Due to symmetry, when computing optimal local distances, 
it is enough to consider distances from the origin to a voxel {x,y,z), where 
0<z<y<x<M and M is the maximum dimension of the image. 

The length of a minimal path is independent of the order in which the steps 
are taken. We can, therefore, assume that the minimal path consists of a number 
of straight line segments, equal to the number of different directions used. 

Not all combinations of local distances a, 6, c, and e are useful. The following 
property is desirable: 

Definition 2. Consider two voxels that ean be connected by a straight line, i.e., 
by using one type and direction of local step. If that line defines a distance between 
the voxels, i.e., is a minimal path, then the resulting DT is semi-regular. If there 
are no other minimal paths, then the DT is regular. 

In necessary conditions for weighted distance transforms in dimensions > 2 
to be a metric, i.e., the corresponding distance function is positive definite, sym- 
metric and fulfils the triangle inequality, are presented. The result is summarized 
by Theorem E 

Theorem 1. A distance transform in Z'^ that is a metric is semi-regular. A 
semi-regular distance transform in is a metric. 

Theorem n implies for the 2D case that a weighted DT is a metric if and only if 
it is semi-regular, and for the 3D case that a metric DT is semi-regular. Thus, 
3D DTs should be semi-regular, even though it is not a sufficient condition for 
being metric. 

For the (a, b, c, e), we can have four types of straight paths (Def. n. We need 
to investigate these straight lines and find the ways they can be approximated 
using other steps, to find conditions for the straight path to be the shortest. By 
such investigation (see P| for the procedure) we find that the semi-regularity 
criteria are given by the inequalities in (^, which defines a hyper- volume in 
a, b, c, e parameter space. 

e<a-l-c e<2b b<c 2e > 36 4c < 3e (1) 

The regularity criteria are not sufficient to determine unique expressions for the 
weighted DTs. We have found that the hyper-volume (HJ is divided into eight 
subspaces, originating eight different cases of semi-regular DTs. (For the 3x3x3 
neighbourhood, there are two different cases, j2|.) The eight Cases occur in the 
following hyper-volumes: 
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Case I: 



Case III: 



Case V: 
Case VII: 



e > a + & 
e > I + c 



e > I + c 
e < a + 26 — c 



e > a + 6 
e < I + c 
e < —a + b + c 
e> a + b 
e < —a + 6 + c 



Case II: 



e < a + 6 
e > I + c 
e > a + 26 — c 
e < a + 6 



Case IV: < 



Case VI: 



Case VIII: 



e < I + c 
e > a + 26 — c 
e > —a + b + c 
e > —a + b + c 
e < b + c 
e < a + 2b — c 
e < a + b 
e < —a + 6 + c 



The differences between two of the possibilities, Case I and Case VIII, are 
illustrated by path maps for all voxels 0 < z < y < x < 4 in Fig. 0, From 
the path maps we can see that, e.g., the minimal path from (0, 0, 0) to (2, 2, 1) 
consists of a combination of one 6- and one c-step in Case I and by one a- and 
one e-step in Case VIII. Examples of (a, 6, c, e)-balls computed for Case I- VIII 
with strict inequalities are shown in Fig. El They are all polyhedra, but with 
different configurations of vertices and faces. 

In the following, we will focus on the first of the eight cases. This case can 
perhaps be said to be the “natural” one, in the sense that it is the case where 
(1, \/2, \/3, -\/6) is placed, i.e., where the Euclidean distances are used as local 
distances, and where the equations are the “expected” ones. The parameter 
space is limited by the inequalities J2I) (from (P) and Case I). 

e<a-|-c e>|-|-c e>a-|-6 e<26 b <2a (2) 



The weighted distance between the origin and {x, y, z) in Case I becomes 



J z{c — b) + y{c + b — e) + x{e — c), 
[ z{e — b — a) + y{b — a) -I- ax, 



ii X < y + z, 
if X > y + z. 



3 Optimality Computations, Case I 

Many different optimality criteria have been used in the literature, see, e.g., 15151 
The results from optimizations using different criteria are often very similar, 
see discussion in El- Optimality is here defined as minimizing the maximum 
difference between the weighted distance and the Euclidean distance in a cubic 
image of size M x M x M . 

For the optimization, the same Lemma as was used for the 3x3x3 neigh- 
bourhood, Pl Lemma E], will be very useful. 
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Case I 




4 



Case VIII 



Fig. 2. Path maps for all voxels Q<z<y<x<A (Case I, top, and Case VIII, 
bottom). 







118 G. Borgefors and S. Svensson 





Case I 
(3x3x3) 



Case II 
(3x3x3) 











Fig. 3. (a, 6, c)-balls fulfilling Case I and II for a 3 x 3 x 3 neighbourhood, m {a,b, c, e)- 
balls fulfilling Case I- VIII, see text. 



The difference between the computed distance (see Eq. 10)) and the Euclidean 
distance is 

Diff^ = (c — b)z + {c+b— e)y + (e — c)x — + y'^ + ii x < y + z, 

Diff^ = (e — 5 — a)z + {b — d)y + ax — ^/x^~+^y^~+~^ , ii x > y + z. 

This difference is to be minimized for an M x M x M image. If we put the 
origin in the corner of the image, we can assume that the maximum difference, 
maxdiff, occurs for x = M (remember 0 < z < y < x). We have 0 < y < M 
and 0 < z < y. Due to the nature of the distance function, we must separate 
the cases M < y + z and M > y + z. The interesting areas in the y, z-plane can 
be seen in Fig. 0 The maximum difference can occur within the two triangular 
areas (/r ,2 and K 2 ), on the boundaries (/ii, ys, ki, K 3 , and A 2 ), or at the corners 
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(^ 4 , K 4 , Ai, and A 3 ). Starting with the two areas, /i 2 and K 2 , the extrema occur 
for ^(Diff) = 0. By using Lemma ^ we have 

Difjmax ^ _^_(^c + b-e)y- + - {c-hY, iiO<z<M-y, 

G<y < M, 

Diff“®^ = aM + (b — a)y — + y^^l — {e — b — aY, \i M — y < z < y, 

f <y<M. 



= M 










M4 



Ail 



K3 



K4 



K 2 



A 2 



1 



Al 



Fig. 4. The different areas, lines, and points where the maximal differences from the 
Euclidean distance can occur in Case I. 



On the lines in Fig. 0 we have the following expressions for Diff: 



Diff^j = {h- a)y + aM - 


if 0 < y < M 


Diff „3 = (e — 2a)y + aM — M'^ + 2y‘^, 


if 0 < y < f 


Diff^j = {c— h)z + bM — y/2M'^ + z'^, 


if 0 < z < M 


Diff ^3 = (2c — e)y + (e — c)M — \j + 2y^ , 


if f < y < M 


Diff ;^2 = (2b — e)y + (e — b)M — ^JM^ + + (M 


- y) 2 , if f < y < M. 


The maxima of all these expressions can be found by using Lemma [D with 5 = 0. 


This, together with the values at the corner points, see Fig. El gives the following 


possibilities for the maximum difference in Case I 




= (a- ^l-(b-aY) M, 


for (M,y^\^,0), 


E^^ = {a - y'l-ie-b- aY - (b - aY) M, for (M, 0), 


^-3 = [a- ^l-\(e-2aY) M, 


for (M,ySkx, ymax)> 


= (a - 1) M, 


for (M, 0 , 0 ), 


Ex, = {b-x/2) M, 


for (M, M, 0), 


Ex, = \{e- ViV^-(2b-eY) M, 


for (M,y^^^,y^^Y> 


Ex, = ^{e-V6)M, 


for (AT,f,f), 
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— 


{b-V2^1-ic-b)^j M, 


for 


to 

II 


ye — c — — {c — b)'^ — {c + b — e)2 j M, 


for 


II 

CO 


[e-c - M, 


for 


II 


(c - \/3) M, 


for (M,M,M). 



Numerical experiments shows that the optimum occurs for = E\^ = E^^ . 
This implies that 

b = Ed — 1 
c = \/3 + 0 — 1 

We use this to limit our problem to two variables, a and e. Numerical experiments 
also show that at the optimum E^^ = —Ef^^ and that the optimum occurs on 
the boundary e = | + c of the Case I hyper- volume. This leads to the solution 

^opt — 17 (j + 2\/3 - \/2 + 2^16^6 + 24v/3 + 22^2 - 99^ « 0.9545, 
bopt = \/2 + 0—1 Ri 1.3687, 

Copt = \/3 + 0—1 ~ 1.6865, 

Copt = — I ~ 2.3709, 

with maxdiff = (1 — o) M r; 0.0455M. 

For the computation of good integer local distances, we also need to find the 
optimal solution for o = 1. This is due to the fact that for the integer solution a 
becomes a scale factor, i.e., in reality = 1. Thus, we solve E*_^ = E*^ = —E^^ = 
— where the star denotes that o = 1 has been substituted in the expressions. 
The solution is given by 

= ^ + ^V2-l ^ 1.3507, 

Copt = - v/2 + ^ + v^V2- 1 R^ 1.6685, 

<pt = 1 + 75 + ~ 2.3507, 

with maxdiff* — ~ "\/ \/2 — 1^ M Ri 0.0635M. 

Both solutions, {dopt,bopt, CoptEopt) and (1, blp^, c*pj, 6*^^), fulfil the inequal- 
ities in 0, i.e., the regularity criteria. 

4 Integer Approximations, Case I 

Working with real- valued local distances is generally not desirable. To find good 
integer local distances, we use the optimal solution for a = 1. The candidates 
for integer approximations of the optimal values, denoted A, B, C, and E, are 
found by multiplying the real- valued solutions 6, c, e by an integer scale factor A 
and rounding to the nearest integer. Note that it is necessary to check that the 
resulting distance transform, (1, ;j) is within the allowed hyper-volume for 
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Case I! All eleven differences E{1, ) are computed, to find the maximum 

for the approximation. Good integer weighted DTs are listed in Tabled together 
with the optimal solutions. The maximal differences for the integer DTs are 
converging towards maxdiff* as expected. 



Table 1. Maximal differences to the Euclidean distance for different (o, b, c, e) weighted 
distance transforms in an M x M x M image. 



Case 


a 


h 


c 


e 


maxdiff 


3x3x3 


O'opt 


^opt 


Copt 


oo 


0.0736M 


3x3x3 


1 


blpt 


Copt 


oo 


0.1024M 


3x3x3 


3 


4 


5 


oo 


0.1181M 


I 


^iopt 


^opt 


Copt 


Copt 


0.0455M 


I 


1 


Kpt 


c* 

'-opt 


'-opt 


0.0635M 


I 


3 


4 


5 


7 


0.0809M 


I 


11 


15 


19 


27 


0.0729M 


I 


14 


19 


24 


34 


0.0687M 


I 


17 


23 


29 


41 


0.0662M 


I 


20 


27 


34 


48 


0.0646M 



The ball for (3, 4, 5, 7) can be seen in Fig. 0 This ball represents the sensible 
choice of a weighted integer distance transform in the extended neighbourhood. 




Fig. 5. A (3, 4, 5, 7) ball with radius 83 voxels. 
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5 Discussion 

We have presented results when optimizing local distances for an extended neigh- 
bourhood in 3D, where we use the three weights in the 3x3x3 neighbourhood 
together with the (2, 1, 1) weight from the 5x5x5 neighbourhood. For the op- 
timal, real-valued, solution, the maximum difference compared to the Euclidean 
distance is 4.55% of the maximum distance occurring in the image. This can be 
compared to 7.37% when a 3 x 3 x 3 neighbourhood is used. In the 3x3x3 case 
the best integer approximation we can have has difference 10.24%, and the sen- 
sible (3,4,5) has 11.81%. Compare this with the best integer approximation in 
the 5x5x5 case, 6.35%, and the sensible choice (3, 4, 5, 7) with 8.09% difference. 

For the other cases of semi-regular (o, b, c, e) DTs, Cases II- VIII, we have 
seen that the expressions for the distances D{x,y,z) become quite complicated. 
This is due to the fact that D{x, y, z) is dependent often on whether an even or 
an odd number of steps has been taken. 

We will continue our investigation of finding optimal local distances for a 
5x5x5 neighbourhood by adding also the steps d and / in the minimal paths. 
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Abstract. A Point Distribution Model requires first the choice of an 
appropriate representation for the data and then the estimation of the 
density within this representation. Independent Component Analysis is 
a linear transform that represents the data in a space where statistical 
dependencies between the components are minimized. In this paper, we 
propose Independent Component Analysis as a representation for point 
distributions. We observe that within this representation, the density 
estimation is greatly simplified and propose solutions to the most com- 
mon problems concerning shapes. Mainly, testing shape feasibility and 
finding the nearest feasible shape. We also observe how the description 
of shape deformations in terms of statistically independent modes 
provides a more intuitive and manageable framework. We perform 
experiments to illustrate the results and compare them with existing 
approaches. 

Keywords: Point distribution model; Shape Representation; Shape De- 
scription; Independent Component Analysis; Independent Modes of Vari- 
ation. 



1 Introduction 

The Point Distribution Model (PDM) 0 is a shape description technique based 
on the vectorized representation of shapes to estimate a statistical model for 
non-rigid shape variations. This model can be used for the generating or testing 
new examples. The statistical modeling for shape variation, and its combination 
with several image processing techniques has generated an important number of 
applications in the last years. These applications include tracking, recognition, 
biomedical imaging, special effects for film and television and registration among 

others P2E2II]. 

The construction of an appropriate PDM for a certain type of shape we 
wish to learn, requires both the selection of a good representation and of an 
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appropriate density estimation method for the distribution of the shapes within 
this representation. For the representation we can use linear or nonlinear mod- 
els. Though several nonlinear models have been proposed I20I21 1,^11 Ml . a linear 
representation is still a common choice for their speed and straightforward in- 
terpretability. As a matter of fact, most of the nonlinear representations are 
applied over a linear representantion which previously performs the dimension- 
ality reduction. On the other side, even when the training set generates complex 
distributions, a linear representation can be used and complexity charged to the 
statistical model 0. The most successful linear representation so far, for its sim- 
plicity and straightforward interpretation, is the one obtained through Principal 
Component Analysis (PCA). By projecting a shape in a previously learnt PCA 
space, we have a set of coefficients or parameters (the principal components) 
which control the variation along maximum variance directions. This is why the 
parameters of the distribution model in the PCA space are known as (principal) 
modes of variation. 

In this paper we propose an alternative linear representation of the PDMs 
using Independent Component Analysis (ICA) . This linear transform represents 
our data in a space in which the statistical dependence between the components 
is minimized 0. This can be particularly interesting for those non rigid shapes 
whose modes of variation are supposed statistically independent. Because of the 
theory underlying PCA, the modes of variation are then given by uncorrelated 
projections of maximum variance. The assumption that these projections are 
optimal for modeling shape variation is not necessarily correct. In certain cases 
(the fingers of the hand can prove to be a good example), higher order rela- 
tionships such as independence can be important for better modeling. The most 
important advantage of independence is that it simplifies density estimation by 
transforming an N-dimensional density estimation problem in N 1-dimensional 
estimations. This provides simplicity and a robust framework. The direct rela- 
tionship between the independent components and the shape deformations, also 
allows robust tagging (classification) and tracking. Both of these features are of 
great importance in specific applications such as Active Shape Models jOj. 

Before exposing the main results we introduce the basics of both Independent 
Component Analysis and Point Distribution Models. Specially in the first case, 
the theory here exposed results incomplete for the interested reader. For an 
extended and up to date exposition of definitions and results involving ICA we 
recommend Aapo Hyvarinen’s survey M- 

2 Point Distribution Models 

If we use n points to describe a certain shape in d dimensions, we can represent 
this shape hy a, N = nd dimensional vector by concatenating the point position 
values. Given K samples of a certain shape, we choose n locations as key points 
or ’’landmarks points”, and obtain K vectors representing each shape of the 
training set. In order to be able to compare these points, a certain alignment in 
an approximate sense is necessary. Procrustes method El or modifications are 
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frequently used in this stage. The selection of a correct criteria for alignment 
should not be underestimated since these operations will greatly affect the final 
distribution by introducing or avoiding nonlinearities |^. For the rest of this 
paper we will assume that our aligned training set is a sample of the random 
vector X. 

The next step is to find a proper representation for x. In the choice of the 
representation simplicity, dimensionality reduction, statistical properties and in- 
terpretability should be considered. 

2.1 The PCA Representation 

From the training set, we can estimate both the mean of the data x and its 
covariance C. Intuitively, the covariance matrix tells us the way each landmark 
tends to move as the others move. Let be the eigenvectors of C and Ai their 
corresponding eigenvalues in decreasing order. If (j) is the matrix built by placing 
the first M eigenvectors as columns, a set of parameters for the shape can be 
defined then by 

b = (^‘(x-x) (1) 

This is the PCA or Karhunen-Loeve transform and the parameters b represent 
the projection of the shapes in the subspace spanned by the eigenvectors of the 
covariance matrix. It can be seen that these projections result uncorrelated and 
this subspace is the best linear subspace of dimension M fit to the data. Thus, 
PCA besides decorrelating, also provides a way for reducing dimensionality. By 
projecting a shape in the PCA space, we have a set of coefficients or parameters 
(the principal components) which control the variation along these maximum 
variance directions. So we can naturally associate each principal component b 
to a mode of variation of the shape. The choice of an appropriate value for M 
can be done in several ways, the most frequent is based on the proportion of 
variance we wish to capture in the subspace. 

2.2 Statistical Density Models for the PCA Representation 

If we have estimated, from the training set, the distribution of the parameters 
b ~ p(b) , reasonable way to decide over the feasibility of shape with parameters 
b is 

P(b) > Pt (2) 

where pt is a certain threshold we consider appropriate. Usually, the threshold 
is chosen so that some proportion of the training set passes the threshold. If the 
parameters b are assumed Gaussian and independent, we have that 

1 ^ 

logp{h) = --J2f + ^ ( 3 ) 

2—1 ^ 

where k is constant for any parameter. In this case, the threshold represents 
a likelihood which constrains feasible shapes to a hyperellipsoid. The size of 



126 



M. Bressan and J. Vitria 



the hyperellipsoid can be obtained considering that the sum of the square of 
gaussian variables has a chi-squared distribution. Using this fact and given a 
certain probability, we can obtain the desired threshold pt- Given a shape, if its 
likelihood is lower than our threshold, the nearest feasible shape is that shape 
belonging to the intersection of the hyperellipsoid and the line passing through 
our current shape and the origin. 

Another approach, is to choose hard limits on each direction [^. This is 
related with the idea of statistical independence of the components of the pa- 
rameters. It is equivalent to constraining feasible shapes to a hypercube. A good 
heuristical value for the threshold on each direction is 3 times the standard de- 
viation on that direction. If we assume a gaussian distribution on each direction, 
this choice of limit values means that a shape is plausible if it belongs to the 
symmetrical mean-centered interval which has a marginal probability of 0.997. In 
this case, for each i = I, . . . , M, the feasibility of a shape is checked by 6' < 3y/Xi 
and the nearest feasible shape is obtained by 

6f = sign(6')*min(3\/)q, |6'|) (4) 

If a simple gaussian estimation is not enough we can use more complex mod- 
els. A useful approach is to model p(b) using Gaussian Mixture Models (GMM) 
0. The parameters for the GMM can be estimated with parameter estimation 
algorithm such as Expectation Maximization (EM) cni. In this case, the plausi- 
bility of a shape is a more complex problem. Even though more precise solutions 
can be developed, a simple one consists on deciding that a shape is plausible if its 
likelihood is above the likelihood of a certain percentage of shapes in the training 
set. The percentage value is generally above 80%. When using a GMM, a general 
solution for the problem of finding the nearest feasible shape is not available so 
Monte Garlo and gradient descent methods are employed. Moreover, estimating 
the GMM parameters on high dimensions is a highly unstable problem. 

3 An ICA Representation for the PDM 

The IGA of an N dimensional random vector is the linear transform which mini- 
mizes the statistical dependence between its components. This representation in 
terms of independence proves useful in an important number of applications such 
as data analysis and compression, blind source separation, blind deconvolution, 
denoising, etc. 

3.1 The ICA Model 

The noise-free IGA Model can be expressed as 

X — X = As (5) 

where x corresponds to the random vector representing our data, x its mean, 
s the random vector of independent components with dimension M < N, and 
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A is called the mixture matrix. The pseudoinverse of A which we will repre- 
sent as W, is called the filter or projection matrix, and provides an alternative 
respresentation of the ICA Model, 

W(x — x) = s (6) 

If the components of vector s are independent, at most one is Gaussian and 
its densities are not reduced to a point-like mass, it can be seen that W is 
completely determined [Z]- 

In practice, the independent components are unknown in advance so ICA 
is performed by estimating W. For the estimation of the filter matrix several 
objective functions such as likelihood, network entropy, mutual information and 
approximations of these, have been proposed pi7ldlbllK| . Though several algo- 
rithms have been tested, the method employed in this article for estimation of 
the independent components is the one introduced by HS| well known as Fas- 
tlCA. This method introduces a progressive minimization of mutual information 
by finding maximum negentropjij directions, and proves fast and efficient. 

Assuming we have learnt the mixing and filter matrix for the ICA Models 
m and we will call s the independent modes of variation, and assume that 
its components are statistically independent. The independent components have 
zero mean and we can assume, without loss of generality, that they have unit 
variance [ 7 |. The choice of dimension is not as straightforward as in PCA but 
there exist several approaches CHI In the shape domain it is conceivable to 
assign small variances to errors in the labelling process so, when required, we 
will reduce dimensionality by first performing PCA and then ICA, considering 
noise the discarded components. 



3.2 Statistical Density Models for the ICA Representation 

Due to the assumption of independence, we need only to model the one- 
dimensional densities corresponding to the M components of s. The complexity 
of the method employed for density estimation is not relevant since we are work- 
ing with a single dimension and the calculations need only be performed while 
training our PDM. Depending on the problem, we can use non-parametric meth- 
ods |4| I D] such as histograms, kernel-based methods with Parzen Windows, Ra- 
dial Basis Functions or semi-parametric methods such as Gaussian or Laplacian 
Mixture Models. 

Let s be a random variable which, within the PDM problem, would corre- 
spond to one of the independent modes of variation. Because of the nature of 
PDMs, we will assume that the density of s is likely to be one of a few particu- 
lar densities. Unimodal densities can be classified in subgaussian, gaussian and 
supergaussian, according to their kurtosis (or fourth order cumulant). Kurtosis 
is zero for gaussian densities, less the zero for subgaussian densities such as the 

^ Negentropy is a non-gaussianity measure based on differential entropy. Mutual in- 
formation can be expressed in terms of negentropy. 
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uniform distribution, and larger than zero for supergaussian densities such as 
the Laplacian distribution, or the delta distribution in the extreme case. A zero- 
centered supergaussian density is the result of the component being mostly close 
to zero and only seldom significantly non-zero. 

In the shape domain, the frequency of a certain position for a particular 
shape will affect the sub or supergaussianity of the modes of variation. A mode of 
variation of a shape which has a preferred position and seldom deforms will have 
a sparse or supergaussian distribution. On the other side, a mode of variation 
with almost equiprobable states along its variation range is clearly subgaussian. 
When preparing a training set for the generation of a PDM there is a tendency 
towards generating uniform distributions of the modes of variation. This also 
favours subgaussian distributions. Symmetrical and gradual deformations of a 
part of the shape correspond to continuous symmetrical distribution. This is 
a frequent situation. In the shape problem skewness is more related with the 
sampling than with the real deformations of a shape. When a shape can be 
found in different states but doesn 't deform continuously from one state to the 
other, we find clusters in the distribution of the mode of variation. Clusters 
are frequently introduced by the incorrect identification of landmark points. We 
conclude that our density model should be open to both sub and supergaussian 
distributions, but particularly the first. Symmetry is very frequent, and it should 
also include multimodal densities, unless we have some other prior knowledge. 

The broadest approach can be simply to use a histogram approximation. 
The problem is that its discrete nature introduces complexity in the equations. 
Immediately related is to use a kernel method with Radial Basis Functions. This 
kernel method positions Gaussians at all the samples in the distribution. In our 
case, and if we have K samples si,...,sk, the corresponding density can be 
expressed as 

^ 1 

Pker{s) = '^—G{s]s,;a) ( 7 ) 

2 = 1 



where a good choice for a if K is sufficiently high is ( ^ )l IE]. 

Even though the kernel method can model satisfactorily the situations pre- 
sented, it is non-parametric. In the development of algorithms which make use of 
PDMs, parametric methods can be more useful since these are faster and provide 
analytic solutions to several problems such as the inverse likelihood problem. A 
possible simplification is to use mixture models such as GMMs or other more 
specific models. For instance, in a problem in which sparsity is known in advance, 
Laplacian (or Double-exponential) Mixture Models can be used on the marginal 
densities. 



Plmm{s) 



L 



E 



Wl _ 



I 



( 8 ) 



Where wi are the weights, /i/ the means and ai measures of variance. As men- 
tioned, the parameters in this case can be estimated through an EM algorithm. 
This semi-parametric methods are good for modeling both unimodal and mul- 
timodal densities but don't succeed when we wish to model highly subgaussian 
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densities such as a uniform density. Experiments were performed using GMMs 
when possible, and RBF kernel-based methods otherwise. 



3.3 Shape Plausibility with ICA 

Suppose we have estimated the density for each of the independent modes of 

variation with one of the methods suggested above so that Sm p^{s). 

1 

Given a certain probability value Pt between 0 and 1, let pt = . For each 

component there exists a union of disjoint intervals 

= K, a-] UK, K] u ■ • ■ “K] 

such that for all TO = 1 ... M I™ satisfies 

J p^{s)ds = Pt 



and 



p{aT) = I 



Vi = 1, . . . ,2tm 



Once we have the set of intervals for each component, using the assumption 
of independence, it can be seen that 



M 



M 






p{s)ds =Y\_ Pm{Sm)dSm = 



Pt 



= Pt 



( 9 ) 



m—1 ' 



m—1 



A constructive method which shows the existence of these intervals in a 
bimodal density obtained from experiments is exposed in Fig.E We first assume 
the probability density is continuous in K. Given the likelihood value L, it can 
be seen that if the line y = L intersects the function y = p'^{s) it has to be in 
an even number of points. These points determine the interval borders. If the 
intersection is empty, we define /"* as the empty set. The method consists in 
starting at a likelihood above the maximum and decreasing the likelihood value, 
thus increasing the probability, until the threshold is reached. 

In practice, implementation depends on the density model. For certain para- 
metric and semi-parametric models, both the likelihood and the interval borders 
can be obtained analytically. This is performed in the training stage. Any algo- 
rithm working on new shapes will need only the interval information for plau- 
sibility tests. Dividing each direction in intervals divides the whole space into 
hyperboxes which have a geometric distribution reminiscent of that which arises 
from separable functions. In Fig. [D^b) the joint distribution of two independent 
directions obtained from real experiments are plotted. Each marginal density 
(clearly bimodal) was estimated with a kernel-based method. The product of 
the marginal densities is plotted with gray levels and contour lines. The rectan- 
gular boxes represent the cartesian product of the intervals estimated for each 
direction for Pt = 0.95. 
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(a) 



(b) 



Fig. 1. (a) For a certain bimodal independent mode of variation, two intervals and a 
certain probability value corresponds to the likelihood value Im- (b) The same but in the 
bivariate case. The curves represent contour lines of the density estimation (done with 
a kernel method), and the rectangles represent the cartesian product of the intervals, 
capturing 95% of the probability. 



Since the calculation of the intervals is performed in the training stage, all 
complex algebraic operations are removed from the working algorithms. This 
is because there is no need for the calculation of likelihoods once we have the 
interval limits. This interval structure also provides precision. It can be seen 
in Fig. mb) that, if we decided to use a GMM, the only way to improve the 
estimation would have been using more than four components in the mixture. 
This can be really bad as dimensions increase and we have no prior knowledge 
of the structure. 

In the interval context, plausibility is easily checked by first projecting the 
shape in the parameter space and then by verifying if s’" G for all m = 

Given the intervals and a shape with independent modes of variation s™, 
the nearest feasible shape sp, with components s™ is 



s 



m 

F 



if sip e 

argmini<i< 2 t^ |s^ — o’"] otherwise. 



( 10 ) 



4 Experiments 

An artificial set of shapes was created. In each shape we use 19 points to describe 
a fixed base and three deformable extensions of fixed length (see Fig. EJ. Each 
extension can be found rotated in an angle between — | and We created a 
training set of 400 shapes, choosing randomly the angle corresponding to each 
extension. We have then, three independent degrees of freedom for each shape. 
Since the shapes were already aligned when created, only centering was necessary. 
Fig. 121(a) shows the three principal modes of variation when given values between 
the mean and approximately three standard deviations. PGA decorrelates each 
of the movements but does not take in account statistics of higher order. The 
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decorrelated movements have no relationship with the degrees of freedom chosen 
in the creation of the shapes. Fig. m shows the three independent modes 
of variation. We observe how ICA separated the deformations corresponding 
to each one of the extensions. This allows unidimensional density estimation 
for the statistical model, where each dimension clearly represents a degree of 
freedom. It also gives a straightforward and easily interpretable solution for the 
problem of classifying a shape according to its deformations. Experiments were 




Fig. 2. ( a) The three modes of variation using PCA for an artificial set of shapes, (b) 
The three independent modes of variation using ICA, for the same artificial set. 



also performed on a dataset of shapes representing 180 hands with the five fingers 
extended. The limited number of samples has strong influence in the estimation 
of the ICA model, so dimension of the Point Distribution Model should be taken 
into account. We finally decided to describe the hands by 11 points each, so the 
resulting dimension is 22. This results in a naive shape descriptor for a hand, 
but a more complete set of landmark points would result in a higher dimension 
and the ICA Model would no longer be trustable. This, of course, can be solved 
by increasing the number of samples. The PCA space of parameters captures 
95% of the data variation with a dimension of 5. In Fig. |3a) we observe the five 
first principal modes of variation. The first two modes capture practically all the 
movement in the hands, mixing the movement of all fingers (except the index 
in the first mode). The remaining three components capture the uncorrelated 
variation of groups of one or two fingers. In all, except maybe for the second 
mode, the variations here presented do not resemble any kind of realistic hand 
movement. The first five independent modes of variation are shown in Fig. Elb). 
These five modes were obtained by performing ICA on the principal modes of 
variation of dimension five. It has been observed UHl that this can bring up 
corrupted independent components. Even though this seems to be the case, we 
observe some interesting differences with the principal modes of variation. First 
of all, ICA isolated the thumb, which clearly is the only really single finger we 
can independently displace in our hand. The rest of the movements resemble 
much more realistic ’’independent” hand movements. For instance the opening 
of the hand moves the thumb, ring and little finger much more than the index 
and middle fingers. This is illustrated in the second mode. The fourth mode 
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Fig. 3. (a) The ve principal modes of variation for set of hand shapes, (b) The ve 
independent modes of variation, for the same set. 



shows the close dependence between index and thumb. Third and fifth modes 
correspond to second and third principal modes respectively. 

Even though this analysis of the modes gives us some idea of what the ICA 
representation can achieve, the most important advantage of ICA is the way it 
simplifies the density estimation. Fig. ETa) shows the distribution of the first two 
principal modes and what results from deciding plausibility assuming a normal 
bivariate distribution or hard limits (ellipse or rectangle). It can be seen that 
none of these assumptions hold. FigE)(b) shows the corresponding independent 
modes, where sources are clearly more separated and the interval method after 
a kernel density estimation was employed. In all cases, the limits were chosen so 
that the intervals encompass for 98% of the probability. 




Fig. 4. Two rst modes of variation for the PCA and the ICA representation respec- 
tively. The limits shown are the limits for plausible shapes. In the PCA case assuming 
independence (square) and normality (ellipse). In the ICA case assuming independence. 
The higher precision of the latter is observed (for testing shape feasibility, for instance). 



5 Conclusions 

In this paper, we expose ICA as an alternative representation for data corre- 
sponding to n-dimensional shapes. This representation, is based on higher order 
statistics, has important advantages for PDMs. The assumption of independence 
makes density estimation a one-dimensional problem and thus, the application 
of complex and accurate methods is permitted. We propose non-parametric and 
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semi-parametric density estimation methods for each of the independent modes 
of variation. The density estimation is used when considering the plausibility 
of new shapes or when approximating a certain shape with the nearest feasible 
shape. A simple method based on feasibility intervals in every direction is ex- 
posed. It addresses both of these problems and is robust to multimodality. These 
intervals are obtained in the training stage and all the posterior operations for 
testing plausibility are reduced to logical order operations. When similar meth- 
ods are used for PCA representations practical results can be unsatisfactory 
since they assume independence and only ensure decorrelation (see Fig.^. 

Moreover, in an ICA framework, modes of variation no longer represent 
uniquely the deviations within the shape and can now be thought of as in- 
dependent deformations. This allows higher control when modeling, due to the 
natural association between independent deformations and each independent 
mode of variation. In the experiments, the independent component that cap- 
tures the movement of the thumb illustrates this idea. This could be applied 
to shape understanding and classification. This last can be done by learning a 
separate ICA model for each of the classes to be considered. The independence 
assumption allows the fast implementation of a Bayesian decision scheme. 

Still, validation on an extended dataset of real shapes showing independent 
deformations is necessary. Testing on Active Shape Models would test the perfor- 
mance of the algorithms and the shape plausibility considerations. Classification 
with the ICA Model and comparisons with previous approaches can be of great 
use in tagging and tracking of shapes. 

From the results we conclude that the ICA representation can be successfully 
used when there are reasons to think that different shape deformations corre- 
spond to independent factors, and the shapes we observe are linear mixtures of 
these deformations. In this case, ICA can not only separate the deformations 
allowing control and classification, but can also provide a robust and simple 
density estimation framework. 
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Abstract. In this paper we propose two techniques to qualitatively estimate 
distance in monocular vision. Two kinds of approaches are described, the for- 
mer based on texture analysis and the latter on histogram inspection. Although 
both the methods allow only to determine whether a point within an image is 
nearer or farther than another with respect to the observer, they can be usefully 
exploited in all those cases where precision is not critical or single images are 
the only source of information available. Moreover, combined with previously 
studied techniques, they could be used to provide more accurate results. Step by 
step algorithms will be presented, along with examples illustrating their appli- 
cation to real images. 



1 Introduction 

Monocular vision concerns the analysis of data obtainable from single images. While 
in stereo or trinocular vision spatial information can be drawn by comparing different 
images of the same scene, in monocular vision the analysis can be performed only by 
studying intrinsic characteristics of the representation. For example, comparisons and 
statistical investigations can be carried out on the distribution of gray levels (in the 
monochromatic case) or of the red, green and blue channels (in the RGB case) of the 
pixels composing the image. Therefore, results which can be obtained are influenced 
by the acquisition systems employed and by the spatial and tonality resolutions 
adopted. 

The two techniques we propose in this paper are based on texture and histogram 
analysis. Although only qualitative information can be obtained through them, we 
think they can be anyway useful when the precision of the results is not critical or 
when there is no other source of data available. Monocular vision may be notably 
more advantageous than techniques exploiting couples or sequences of images, since 
it requires simpler acquisition systems and is computationally more efficient in terms 
of execution times. Moreover, qualitative estimations could be used to confirm 
evaluations obtained by means of other more precise techniques (such as binocular 
vision, infrared sensors, etc.). 

The paper is structured as follows. Section 2 will describe the approach based on 
texture analysis. After a brief discussion about previous works regarding the topic and 
an introduction to the theory behind it, two practical algorithms will be presented, of 
which the second can be used to distinguish between nearer and farther zones within 
images in perspective projection. Section 3 will describe the technique based on his- 
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togram analysis. A brief introduction will precede the presentation of the imple- 
mented algorithm, which, like the texture-based one, allows relative distances within 
images in perspective projection to be estimated. Section 4, lastly, will draw some 
conclusions and suggest directions for future research. All the algorithms are followed 
by examples illustrating their use. 



2 Texture Analysis as a Source of Spatial Information 

In general, a surface can be considered as being characterized by some form of texture 
if the relevant shapes composing it are uniformly distributed, i.e. if they do not differ 
very much in appearance, size and density throughout the extension of the surface 
itself [1]. 

Essentially, two different approaches have been proposed for texture analysis. 
Some researchers (such as [2]) follow a structural approach, which requires the real 
structure of texture to be determined (periodicity, uniformity, symmetry, etc.). Al- 
though this is probably the method used by the human vision system to infer 3D 
structure of the environment, it is difficult to automate. Other more feasible tech- 
niques limit themselves to making assumptions about the texture arrangement. For 
instance, if the texture is isotropically distributed, i.e. line segments composing the 
real surface have not a prevalent direction, three-dimensional information can be 
obtained from the “preferred” direction observed. This approach was first proposed 
by Witkin [3] and subsequently improved by Davis et al. [4]. Ceding [5] and Blake et 
al. [6] perfected the method under the hypothesis of orthographic projection. As re- 
gards perspective projection, Kanatani [7] provided a rigorous mathematical descrip- 
tion of the problem. Under the hypothesis that texture is composed of points and 
straight lines only, he developed a technique based on texture homogeneity and den- 
sity. However, this algorithm is not able to produce good results applied to real 
scenes. In fact, Kanatani analyzes how the density of image points varies as distance 
increases, asserting that for greater distances density along a surface grows. Unfortu- 
nately, in most real cases the number of relevant points decreases, because of lenses’ 
blur. 

We will now concentrate on Witkin’ s algorithm, in the form corrected by Davis et 
al. [4]. In Section 2.2 it will be used to determine, given two points on an image in 
perspective projection, which one corresponds to a farther region in the real scene. 



2.1 Texture Analysis in Orthographic Projection 

When observing shapes on a plane surface, two different kinds of geometric distor- 
tions can be noted: (1) as the surface departs from the observer, shapes appear to be 
smaller and smaller; (2) the more a surface is inclined with respect to the image plane, 
the more shapes on the image appear to be flattened towards the tilt direction (fore- 
shortening effect). As will be shown, such distortions can be usefully exploited to get 
spatial information about the real scene. 

In image processing, when the orthographic projection hypothesis is satisfied, 
which means that every point in the three-dimensional space is orthogonally projected 
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on the image plane (see Figure 1), effect (1) can be ignored and it becomes easier to 
estimate the orientation of the plane on which the real scene lies. 




Fig. 1. Reference cartesian system. Orthographic (point p’) and perspective (point p) projec- 
tions of the spatial point P on the image plane 

As already stated, for orthographic projection Witkin [3] was the first to face the 
problem of drawing the spatial arrangement of a plane from image texture analysis, 
under the hypotheses of independence and isotropy, which can be summarized as 
follows: 

1. Given an image on the image plane, the possible orientations of the planes on 
which the original scene may lie are equiprobable. 

2. On the plane on which the original image lies, the possible orientations of the tan- 
gents to the curves composing the image itself are equiprobable. 

3. Orientations of the elements in the real image are statistically independent. 

Witkin’ s studies led to a practical algorithm by which it is possible to estimate the 
crand rpolar coordinates (slant and tilt) of the plane on which a scene lies (Figure 1). 
Let’s suppose to have a plane S, on which there are curves and shapes satisfying the 
isotropy and independence hypotheses. Also, suppose the orthographic projection of S 
on the image plane I is satisfied. Then, directions fi of the tangents to the curves on S 
will have p.d.fXJi} = l//ras probability density function, where jds [0, 7t[. As regards 
parameters crand r, the p.d.f. function can also be expressed as p.d.f.{(y,f) = (sind)/7t. 
In fact, consider the gaussian sphere, which is the unit radius sphere formed by all the 
possible normals to all the possible plane arrangements in space. For each value of a, 
the possible values of rdefine a circumference on the gaussian sphere, whose radius 
approaches 1 as cr approaches 7t/2. The probability that the orientation of a normal 
corresponds to a certain point on a certain circumference is the same for every point 
on it and is proportional to the length of the circumference itself {2Ksind). Therefore, 
the p.d.f. function can be simply expressed in function of sin a, without the need to 
introduce parameter T. Since, in general, it is not possible to measure /Jangles di- 
rectly, it is necessary to find their transformation into the corresponding angles on the 
image plane 1 (let’s call them a). If the orientation of plane S is the image on 1 
can be translated into the one on S by rotating it by the same angles. Then, supposing 
r= 0, point p{x, y, 0) on I will have X = xcoso, Y = y and Z = xsincj as coordinates on 
S. Since the orthographic projection of a spatial point (X, T, Z) on I is (X, Y), a sim- 
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pie way to get the projection of a curve which lies on plane S on plane I consists in 
“placing” the curve on I, rotating it by {(T,T), and projecting it again on I. Now, let’s 
consider versor t = [cosfi, sinp\, tangent to a curve on S in a certain point, according 
to an angle J3. After the rotation on I, it will become t = \cosjicosO, sinj3\. Therefore, 
we have that tana= sin {cos /3-cos d) = tanplcos(7, where a is the orientation of ver- 
sor t on plane I. If 0, it is sufficient to add it to a, which means that a = 
atan{tan/3/cosd) H- rand f3= atan{cos O ■ tan{a-f)). 

Considering that, in general, p.d.f{(p[x)) = p.d.f.{x)-dx/d(p{y) , we find: 



p.d.f.( a{/3)\a,r) = p.d.f{/3 1 = - 



COS a 



da n: cos^ {a - t) + cos^ a ■ sin^ {a - t) 

However, an image will be composed of many curves and hence it is also neces- 
sary to find the compound probability density function p.d.f .{A | cr,r) , where A is a 
set [a.] containing the various a angles measured on the image plane for all the 
curves on it. If all a. are independent, the following can be obtained: 

( 1 ) 



P 



.d.f.{A = ( } I cr,r) = p.d.f.{a. \ a,r) = 



-n 



n cos a 



cos^{a. - r ) -I- cos^ a ■ sird [a.-r) 



where n is the number of a angles. Substantially, the preceding expression allows 
us to calculate how much probable a set of values { a.} is, given cr and r However, we 
are interested in the opposite problem, i.e. in finding the probability to have a par- 
ticular couple (C7,f) given a set of values { a.}. Applying the Bayes rule and normal- 
izing so that the integral of the probability density function is equal to one, we have: 

P^/{<^.r)p^/A\cr.r) ( 2 ) 

[ p.d.f.{a,T)p.d.f.{A\a,T)d(7dT 

That couple (cf,T) which maximizes p.d.f .{a , t \ A) is the more probable one and 
therefore it can be assumed as the orientation of plane S (maximum likelihood esti- 
mate technique). From expressions (1), and remembering that p.d.f. {(T,f) = sindK, the 
value of the numerator of expression (2) can be calculated as follows: 



p.d.f.{a,r)p.d.f.{A \ 



sina -pA k ‘ coscr 

K M cos^(a( - f ) + cos^ a ■ sin^(a. - t) 



(3) 






= exp\ 



log 



V 



sma 



n 



-^a.log 



K COS a 



cos^ia^ -t)+ cos^ a • sin^{ai 



where a. is the number of measures whose value is a. 

The previous considerations lead to the following algorithm to estimate slant and 
tilt of the plane on which a scene lies. 

Algorithm 1. The practical algorithm we have implemented is derived from that of 
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Witkin and is composed of the following steps: 

1. If the original image on the image plane is known to be affected by noise, it is 
filtered through a low-pass filter. 

2. After having been normalized, the image undergoes an edge-detection process, 
through the Sobel operator. As a result, two different “images” are obtained from 
the gradient, namely the module image and the phase image. 

3. The module image is thresholded against a certain value, to get a binary image 
(mask) identifying only the most relevant edges. We are studying for an automatic 
evaluation of this value. 

4. For each pixel in the original image whose value in the mask is one, the direction 
of its tangent is calculated. This is done by simply rotating by 90 degrees the corre- 
sponding value in the phase image. 

5. Interval [0, 7t[ is subdivided into n sub-intervals. An array A = (a,,...,aj is built in 
which a. is the number of values obtained in step 4 falling into the i* sub-interval 
(a.). Interval [0, 7t/2] (cr values) is subdivided into m sub-intervals and interval [0, 
7t[ (rvalues) is subdivided into p sub-intervals. 

6. For each possible couple (cr, t") (/e[0, m-1], A:e[0, p-1]), expression (3) is calcu- 
lated. That couple (a,„\.) for which the result is maximum is taken as the orienta- 
tion of the plane containing the real scene in the three-dimensional space. 

Experimental Results. We tested algorithm 1 with many images and the results 
obtained confirm its effectiveness in estimating slant and tilt of real scenes. Figure 2 
shows three example images, for which the computed crand rare reported. Of course, 
the algorithm considers them as if they were in orthographic projection. To give an 
intuitive insight of the results obtained, circles oriented according to the calculated 
values are superimposed on the original images, and, at the center of the circles, 
segments normal to the estimated planes are placed. For the algorithm, the following 
parameter values have been assumed: n = 64, m = 90, p = 180, = 7^\l2+i)ln, cr = 

Ml/2+j)/2m, \ = 7!{\/2+k)lp, where ie[0, n-]],js[0, m-1] and A:e [ 0, p-i]. 




0 = 56.5° 0 = 69.5° 0 = 58.5° 

1 = 90.5° 1=147.5° 1 = 87.5° 



Fig. 2. Three examples of application of Algorithm 1. Beneath each image, which is considered 
as if it was in orthographic projection, the calculated crand ^parameters are reported 
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2.2 Texture Analysis in Perspective Projection 

In perspective projection, each point in the three-dimensional space is projected on 
the image plane through the focus of the optical system (see again Figure 1). When, as 
often occurs, the orthographic projection hypothesis is not satisfied, the algorithm 
presented in the previous section can only produce approximated results. In fact, 
shape size lessening and the foreshortening effect cause the image on the image plane 
to be less “uniform”, thus weakening the concept of texture itself. However, intui- 
tively, if from an image in perspective projection sufficiently small regions are ex- 
tracted, it will be reasonable to consider them as if they were in orthographic projec- 
tion. If algorithm 1 is then applied to each of them, we will find that those regions 
which are farther with respect to the tilt direction produce greater values for parameter 
CT, because shapes turn out to be more flattened. In fact, referring to Figure I, given a 
point P(X,Y,Z) on plane S, the corresponding coordinates on the image plane will be x 
= fiCKf+Z) and y =fY/(f+Z). Let’s suppose r= 90° and consider the perspective pro- 
jection of a circle placed on S on the image plane. The resulting image, unless <7 = 
90°, will be an ellipse, as shown in Figure 3. 




If the distance of point y, from the image plane is Z and the circle’s diameter is D, 
the ellipse’s Ax width can be calculated as fDI\f+Z+(D/2)sind]. If y, has Y as the 
corresponding coordinate in space, the ellipse’s Ay height \s Ay = y^- y,, where y^ = 
jYlif+Z) and y^ = f{Y+Dcosd)/(f+Z+Dsind). Since the equation of plane S is Z = 
tan cj Y+ S (where <Jis the coordinate of the intersection of plane S with the Z axis), we 
obtain: Zjy = y^ - y, = fDcoso(f+ S)l[(f+Z)(f+Z+Dsind)]. If the projection was ortho- 
graphic instead of perspective, relation cover = /fy/Ax would hold (where cr indicates 
just the slant calculated in the orthographic case). In fact. Ax would be equal to D and 
Ay would be the projection of D on the image plane, i.e. Ay = Dcoscr^. However, if a 
sufficiently small area is selected on the image to be processed, algorithm 1 can pro- 
vide a good approximation of the image slant. Then, supposing this is the case, from 
the previous expressions we have: 

cos cr if + 5\ f + Z + — sit 
Ay \ 2 

cos a„ = — = 7 TT 

Ax [f + Z^f + Z + Dsina ^ 

where cr indicates the slant calculated in the perspective projection case. It is evident 
that when Z increases (i.e. the distance from the observer grows) also cr must in- 
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crease. 

The previous considerations suggest a simple but effective way to take into account 
perspective when trying to estimate distance in images for which the orthographic 
projection hypothesis is not valid. 

Algorithm 2. Given an image on the image plane, to determine whether a certain point 
is farther or nearer than another (with respect to the observer), it is sufficient to con- 
sider two areas around them and apply algorithm 1 . The farther point is that for which 
cris greater. 



Experimental Results. Figure 4 shows some examples of application of algorithm 2. 
Rectangles on the images identify the regions analyzed through algorithm 1, whose 
estimated crand rare reported. Values assumed for the parameters are the same as for 
the examples in Section 2.1 (n = 64, m = 90, p = 180). 




o. = 61.5° 
X, = 88.5° 



o, = 51.5° 
X, = 96.5° 





o. = 73.5° 
X, = 149.5° 



Oj = 71.5° 
Xj = 144.5° 




o, = 43.5° 
X, = 158.5° 

o, = 79.5° 
X, = 177.5° 




Oj = 71.5° 
X, = 88.5° 

o, = 60.5° 
X, = 90.5° 



Fig. 4. Examples of application of Algorithm 2. At the right of each image, in correspondence 
with rectangles identifying the examined regions, the estimated crand rparameters are reported 
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3 Histogram Analysis for Distance Estimation 

In this section, we will present another algorithm for qualitative estimation of distance 
in monocular vision: histogram analysis. Instead of using geometric parameters, it 
exploits physical properties of the medium in which the objects in the images to be 
analyzed are contained. The technique, which is primarily intended for being used in 
combination with other methods, is currently in a development stage and we are still 
working on it to automate the choice of the values for some key parameters. 

Everyone can note that, for instance in images representing landscapes, edges of 
distant elements (such as mountains) are not very clear: colors of different objects 
tend to blend each other and generate zones with nearly uniform colors. Such an ef- 
fect is more marked if haze or fog is present, or when underwater images are exam- 
ined. In general, it is more perceivable in images with very long depth of field com- 
pared with the medium opacity. 

Rays of light in an opaque medium are diffused by its molecules. For example, air 
contains a great number of water particles, which refract light. In a monochromatic 
image, the background’s gray levels tend to be the weighed average of all the gray 
levels present in the image itself. As a result, the image becomes more and more uni- 
form as the distance from the observer increases, thus producing a sort of blur. This 
blur, however, is not to be mistook for that produced by camera lenses, which must be 
absolutely avoided to get valid results from the technique we are now going to de- 
scribe. 



3.1 The Algorithm 

As already stated, the zones of an image which are perceived as more blurred (i.e. 
which are farther with respect to the observer) have gray levels gathered around an 
average value. On the contrary, in those areas where edges are sharper (i.e. which are 
nearer with respect to the observer) gray levels are more scattered. Information about 
the distance of a zone with respect to another can thus be obtained by analyzing the 
gray level spectrum in these regions. 

The algorithm we present here is based on variance applied to the image histogram. 

Algorithm 3. Consider two points and on an image. To determine which one is 
nearer (or farther) with respect to the observer, we can proceed according to the fol- 
lowing steps: 

1. Two areas Aj and A^ around Pj and P^ are extracted from the image and their histo- 
grams are obtained. 

2. Both for A, and A^, the average gray levels x^, and '^heir pixels are calculated. 
In general, indicating with/(x) the number of pixels whose gray level is equal to x, 
the average is obtained as follows: 



X/<' ^ 

3. Both for A, and A^, variances and al are calculated in the following way: 
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^ H/( ^ 

4. The point which is farther with respect to the observer will be that whose variance 
is lower. 



Experimental results. As an example of application of algorithm 3 to a real image, 
Figure 6 displays the histograms relative to the three areas highlighted by rectangles 
(a, b and c) in Figure 5. 




Fig. 5. An example of application of algorithm 3. White rectangles are used to highlight the 
areas analyzed fa, b and c) 




IL i 

a =13.99 a = 7.69 a =5.75 

(a) (b) (c) 



Fig. 6. Histograms relative to the areas highlighted in Figure 5, and corresponding variances 
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As can be noted, the histogram of the nearest zone (rectangle a) is more scattered 
than that of the farthest zone (rectangle c) and, accordingly, the variance of the first 
area (13.99) is greater than that of the second one (5.75). 

Of course, dimensions of rectangles may remarkably affect the results. We are cur- 
rently trying to make such a choice automatic, based on a preliminary segmentation 
phase to select sufficiently homogeneous areas. 



4 Conclusions 

In this paper we have presented two techniques which allow distances to be estimated 
from a relative point of view. That is, they allow to determine whether a point within 
an image is farther or nearer than another with respect to the observer. 

Two algorithms, applicable to images in perspective projection, have been de- 
scribed. The first (Algorithm 2), based on texture analysis, has been tested with a 
great number of real images in which some form of texture was recognizable, always 
giving very good results. The second (Algorithm 3), which is based on histogram 
analysis and is still to be perfected, has shown its good applicability especially to 
representations with long depth of field (e.g. landscapes). Although some precautions 
are to be taken (there must not be any form of lens blur), the use of this method is 
favored by its computational lightness. 

We hold that, although they cannot be exploited for actual distance estimations, the 
proposed algorithms, integrated with others previously studied [8], can be usefully 
used for qualitative evaluations. Moreover, they can be exploited as a preprocess step 
able to detect regions of interest for successive application of more expensive tech- 
niques, to provide more accurate results. Current work is devoted to this aspect and to 
the integration of the system with different sources of information. 
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Abstract. A common model of second degree variation is an ellipsoid spanned 
by the magnitudes of the Hessian eigenvalues. We find this model incomplete 
and often misleading. Here, we present a more complete representation of the 
information embedded in second degree derivatives. Using spherical harmonics 
as a basis set, the rotation invariant part of this information is portrayed as an 
orthonormal shape-space, which is non-redundant in the sense that any local 
second order variation can be rotated to match one and only one unique 
prototype in this space. A host of truly rotation invariant and shape 
discriminative shape factors is readily defined. 



1 Introduction 

The interest in three-dimensional volume analysis is steadily increasing with the 
advent of more powerful imaging technologies such as helical cone beam Computer 
Tomography and more advanced modalities of Magnetic Resonance Imaging. As an 
initial step in such analysis, some authors have been measuring the local second order 
density variation at each grid point. If magnitude and rotation (orientation) 
dependency is eliminated from this variation, the remaining information is shape. The 
subject of this paper is how to describe the totality of this shape information, or, in 
other words, to define the second order shape space for 3D-volumes. It seems 
customary to assume that all of these possible shapes can be modeled as ellipsoids, 
something we find to be mathematically incorrect as well as a severe limitation in 
practical applications. 

The local second order variation is measured by convolving the 3D-function 
f(x,y,z) with six derivators g\ = {g g g g g g . The response 

vector consists of the derivative estimates (/^ , fyy , , f^y , /^j, , fy ^ ) , which also can 

be assembled as a symmetric 3x3 matrix, the Hessian. Typically, the derivators are 
designed as differentiated Gaussians, which calls for reasonable compromises 
between minimum approximation errors and computational efficiency. 

In many applications the next processing step is derotation, which is to separate the 
rotation dependent information from the rotation-invariant part in the response vector. 
The standard procedure is to diagonalize the Hessian to obtain three eigenvalues 
their corresponding eigenvectors. After proper normalization the 
eigenvectors comprise the columns of the 3x3-rotator matrix R with three degrees of 
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freedom. Let = (x, y, z) ■ The rotator R should be able to transform the local 
neighborhood f(x) into a unique prototype function p(Rx) so that 

ifxxJyyJzzJxyJxzJyz)—^ (P xx ’ P yy ^ P ZZ 

which separates orientation information from magnitude and shape. The orientation 
appears in the rotator R, magnitude and shape in the prototype vector {p^ , Pyy, Pzz ) ■ 

It would then be tempting to just identify the three eigenvalues taken in any order 
with the three prototype derivatives ip^,Pyy,p^) ■ Unfortunately, since there are no 

less than six possible permutations (indexing) of the three eigenvalues and the 
eigenvector columns of the rotator R, the uniqueness of such a prototype is not 
secured. In this paper we will show that there is a rule for selecting a permutation to 
secure a truly unique derotation, which at the same time delivers a two-dimensional 
continuous shape space residing on a part of the unit sphere. 

As mentioned, several authors have been employing second derivatives and the 
Hessian eigenvalues to detect and discriminate for shape. Lorenz et al [1] distinguish 
a bright line (string) when /Ij » X 2 « 0, I/I 3 1 » 0 and derives several stringness 

2 U 3 I 

factors from the eigenvalues, e.g. 1 - -j — ^ ^ . Sato et al [2] observe the same string 

detection rules (except for a different ordering among the eigenvalues) and distinguish 
them from sheets (planes) for which | < 0, |/li | » |/l 2 1 » 0 . Frangi et al [3] suggest 

three shape descriptors based on various combinations of the magnitudes of the 
eigenvalues. Similar shape descriptors are also used by Kindlmann et al [4] to span a 
shape space, which is visualized as a triangle with the three archetypal ellipsoidal 
shapes called blob-like ellipsoid, string-like ellipsoid and plane-like ellipsoid. This 
shape space is essentially the same as the one brought forward many years ago by 
Knutsson [5] although the tensor behind the latter shape space is not the Hessian but a 
set of quadrature filter responses. A special feature of these responses is phase- 
invariance with respect to the locally dominating frequency components. Similarly, 
Basser [ 6 ] portrays the eigenvalues of the diffusion matrix with ellipsoids, which may 
change in shape depending on the relative magnitudes between the eigenvalues. 

For tensors like the 3D-diffusion coefficients, where all eigenvalues have the same 
sign, it makes some sense to model the local variation as an ellipsoid. However, the 
Hessian and other tensors, which are not restricted in this way, have a much richer 
variation. This is revealed already in the 2D-case where the two eigenvalues 
(lj,/l 2 )are identified as the prototype derivatives ip^,Pyy) or iPyy,Pxx) ■ If these 

quantities are equal in magnitude but have different signs, the local intensity variation 
is shaped as a saddle surface; if the signs are equal it is a positive or negative 2D- 
blob. These two shapes are extremely different, in fact orthogonal. A method or a 
representation that fails to detect, or doesn’t distinguish between a saddle and a blob 
is incomplete to say the least. 
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2 The Spherical Harmonics Orthonormal Basis Set 



Let hgir) be a rotationally symmetric 3D-function, e.g. a Gaussian. The six second 
degree derivators are then obtained in the signal and the Fourier domains as 
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The above derivators (G^^,Gyy,G^,G^,G^^,Gy^) in the Fourier domain are 
combined as follows in (4) into orthonormal spherical harmonic operators 
(C 20 ,---C 2 s ) with normalizing factors yielding 
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Because orthogonality and normalization, as well as harmonic variation in the angular 
direction are preserved over the Fourier transform [7] we have in the signal domain 
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Existing in both domains, the polynomial forms in ( 2 ) and ( 5 ) indicate a strong 
relationship between moment functions and spherical harmonics. The radial variation 
Hq (p) , common for all six basis functions in the Fourier domain, corresponds over 
two Hankel transforms of different orders with the functions in the signal domain. 
This is because the harmonic variation is of zero order for the Laplacian operator 
C20 -C 20 and of second order for the five other basis functions. 

The mapping ( 4 ) from derivators to orthonormal basis functions performed by the 
6x6 matrix M carries over directly to the derivatives. Therefore, we also obtain the 
total response f 2 and the derotated protoype response P2 as coefficients for the six 

orthogonal basis functions (C20C21 , C22 , C23 , C24 , C25 ) as follows 

^2 - (flO’ fll’ fl2’ fl3’ fl4’ XX’ f yy’ f ZZ’ f xy’ f XZ’ f yz^ ’ 
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Both vectors will now appear in this perfectly orthonormal and complete shape 
space. We notice that out of the six basis functions, (C23 , C24 , C25 ) are identical to the 
three cross-derivators (g xy 8 xz’ 8 yz^> except for a scale factor. Hence, the above 
diagonalization procedure, which eliminates the responses ifxy’fxz’fyz'^ ’ leave 
us with prototype responses that we, at least initially, could identify as 
P XX - P yy - ^ 2 ’ Pzz ~ with a randomly ordered set of eigenvalues. This 

preliminary response vector p'2 is shown in the (C20 , C21 , C22 ) -space in Figure 1 . 

Alluding to their shape in signal space indicated by the attached icons in Figure 1 , 
the three basis functions are named Blob, Double Cone, and (four-wedge) Orange, 
respectively. The solid lines in these icons represent zero-crossings of the intensity, 
the boundaries between positive and negative density relative to the underlying DC- 
level and more global gradients. The smooth decay of the magnitude in the radial 
direction due to the functions h2Q (r) and h2 (r) is indicated by a dotted contour. 



3 The Non-redundant Shape Space 

The (C20, C2i,C22)-space in Figure 1 is redundant. Certain shapes repeat themselves 
along the equator of Figure 1 , the subspace spanned by (c2i,C22)- Using the 
polynomial expressions ( 5 ), in (8), ( 9 ) we obtain two alternative sets of basis 
functions by rotations around the C20 -axis with 120 '’ and 240 ° , respectively. The 
polynomial expressions for both these sets are identical to the ones for (c2i,C22) 
except for cyclic permutations of the signal space coordinates. The shapes they 
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represent in signal space are therefore identical except for rotation of the coordinate 
system (x, y, z) around the vector (1, 1, 1) (the cube diagonal) with 120° and 240° , 
respectively. 





Fig. 1. The redundant shape space (C20 , C21 , C22 ) 
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We notice that the shapes of and are identical except for rotation around 

the z-axis with +90" or -90° in the signal space, equivalent to a shape-invariant 
flipping of the equatorial plane around the C 21 -axis in the (c 2 o,C 2 i,C 22 )-space. We 

simply swap the roles of and in the polynomials. For symmetry reasons there 
are three possibilities for such a pair-wise swap. All the six possibilities to perform a 
mapping between the eigenvalues and the prototype derivatives are then accounted 
for. Only one sixth of the 360° circumference of the equator in Figure 1 is needed to 
describe all shapes, provided we make certain that all prototype responses end up in 
the same sector. 

In Figure 1, let us select the 60° -sector ±30° around the unit vector C 22 as the 

“chosen one”, knowing that any of the five other 60° -sectors would be just as fine. 
The chosen sector of the non-redundant shape space is portrayed in Figure 2. The 
shape angle t/ = arg(c 2 i,C 22 ) is then restricted by 

^ < „ < 2fL (10) 

3 - ' 3 • 



We use the polynomial expressions in (5) and (7) in (10) to obtain 
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Hence, we should make the eigenvalue/prototype assignment as 
Pxx^K^ Pyy^h^ PTZ^'^^^ where 



(12) 



(13) 



In Figure 2 we find the axially symmetric shapes (strings, planes) along the 
boundaries of the 60° wedge. It can be shown [8] that the best match for ideal strings 
and planes are obtained with the prototypes called optimal string and optimal plane, 
which deviate significantly from the ideal shapes. Along the meridian through C 22 we 
find the orange but also some structures that we call ovals which are located 
approximately half-way between the blobs at the poles and the orange at the equator. 

In the introduction of this paper we claimed that the ellipsoidal model makes sense 
only when all eigenvalues (all prototype derivatives) have the same sign. In Figure 2 
we find those parts of the shape space at the top and bottom, respectively. The upper 
“ellipsoidal” area is limited by a curve on which the smallest eigenvalue 
- Pyy - 0 , the lower one by a curve where the largest eigenvalue - p^ - 0 . 
This in accordance with the fact that the icons, the normalized prototypes, located on 
these curves also lack contributions from g yy and , respectively. 
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C 20 Blob 




Fig. 2. The non-redundant shape space 
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4 Rotation Invariants. Shape Factors 

Whatever pose we give a certain prototype, the derotation procedure described above 
will return the same coefficients {p2o, P22) in the (c2o,C2i,C22)-space. Hence, 

any function using the arguments (pjg , P21 ’ ^22 ) or the derivatives iPxx’ Pyy ’ P 

is rotation invariant. One such function that have been proposed by some authors as 
an energy measure is 

2 o2,o2,o2 2, 2, 2 - 2, 4/2, 2x ('Id') 

Pa +-^ 2+^3 = Pxx+ Pyy+ Pzz ^ ^ P 20 + J (P 2 \ + P22) ’ 

where we have derived the last part of this expression from the inverse of ( 6 ). p^ is 
rotation invariant but not shape invariant, which by all likelihood was the intention. It 
over-emphasizes the /?2o -component and under-emphasizes the iP2\^ P22^~ 
components. Hence, for a certain given signal energy it gives a higher value for blob- 
like shapes, a lower value for shapes which appear at the equator in Figure 2 . Let 
P2 ~ IIP2II’ /2 “ 1^2 II ■ The only energy measure indifferent to both rotation and shape 
is the sum of the energies of the orthogonal components, which yields 



pI = pIo + P2I + P22 =pIx+ ply + pI -\{P xxPyy + PxxPzz + PyyPzzJ 
A = /20 + /2I + /22 + /2s + /24 + /2s “ 



f2 + r2 /2_J_// f f f ]+5|/2 /2 /2| 

J XX J yy J zz 2 V xxJ yy J xxJ zz yyJ zzj 2 V^ J xz yzf 



and can be computed directly from the derivatives without derotation and serve as the 
first discrimination level for voxels-of-interest. In general, all functions with 

arguments {P2^ P20) or [pi, pi ~ pIo)~[pIo, pI\ + F22) ^an be computed without 
derotation, which also includes the function pi just mentioned. To be able to 
discriminate for shape we may use the shape angle k in Figure 1 , which is defined as 



• t'H) • P'H) 

K= arcsin^^^ = arcsin-^^ 



A rather general discriminative and energy independent shape factor can be 
obtained, still without computing eigenvalues/eigenvectors, as 




which is zero both for blobs, where P2 - \ p2o\, ^ ~ Ty , and for purely Laplacian- 
free shapes where p2Q = 0 , k" = 0 . The shape factor 2 ^ takes maximal and minimal 
values H -1 and -1 for k -±^ , respectively, i.e. for shapes that are in between the 
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poles and the equator in Figure 2 . This is, approximately, where we find second- 
degree variations of type planes and strings to be dealt with below. 

The angle rj in Figure 2 is also shape specific. All shapes at the two edges of the 
shape space are rotationally symmetric around the x-axis and the y-axis, respectively. 
In general for axially symmetric prototypes we have 

R\ \ ^ ■ S ( 18 ) 

v3|p2i| = F22 smr/ = Y- 



It is interesting to note that the inner loop of the Jacobi method for diagonalization 
[PressSS] should terminate rather quickly for axially symmetric and nearly axially 
symmetric shapes. The details of this argument are left out here for the sake of 
brevity. However, lack of fast convergence indicates non-axial shape symmetry, 
which could be set to trigger early termination if such shapes are considered non- 
interesting. We have found that the successive discrimination of voxels-of-interest by 
energy, by shape (e.g. with ), and by early termination of the eigenvalue 
computation brings about dramatic savings in computation time [ 8 ]. 

Assuming that we have access to the complete prototype vector we propose the 
following shape factor for axial symmetry. 

^ ^ 2V3 P21P22 
o 2 , 2 ’ 

3 P 21 + P22 



which is H -1 and -1 for shapes with a perfect symmetry axis along y and x 
respectively, zero for shapes along the C22 -meridian in Figure 2 . However, it is 
important to be able to distinguish between strings and planes that is indicated by 
Sgn{p 20 P21 ) ■ To oaay use the product 
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(20) 



which takes the sign from the factor P20 ^21 > since all other factors including P22 are 
positive. But from the properties of the two shape factors and follows also 

that we obtain the maximum value + 1 for all strings and the minimum value - 1 for 
all planes independent of polarity. It also follows that the zero values are found on the 
equator and along the c 22 -meridian. Hereby, the ovals, halfway between planes and 
strings, will also get the shape factor value zero. 



5 Conclusions 

This paper advocates that the second derivative response vector should be mapped 
onto an orthogonal space, with the help of spherical harmonics. Without an 
orthogonal space many misunderstandings tend to prevail, e.g. that the second degree 
derivators (g^, g yy,g are mutually orthogonal just as the three gradient detectors 
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(.8 X’ S S ■ Rotation invariance and shape are two sides of the same coin. One 

cannot be understood without the other. None of them can be fully grasped without 
portraying the complete and non-redundant shape space, the variation that is left after 
the object under study is derotated to match its prototype. We have presented here a 
method, which for the first time is able to reduce the six-dimensional measurement 
space into an orthogonal non-redundant three-dimensional space for shape and 
magnitude. All response vectors and functions thereof are rotation invariant in this 
space. In summary, the steps of this generic and open-ended method are as follows. 

T 

1. Convolve the object function f{x, y, z) with the derivators Q 2 to retrieve H y . 

2. Compute the eigenvalues and the eigenvectors of the Hessian H y 

3. Map the eigenvalues onto the reduced prototype space (c 2 o,C 2 i,C 22 ) 

4. Compute orientation angles, magnitude, shape factors and other discriminators 

A more complete mathematical background to the subject of this paper, as well as 
implementations and experiments in image enhancement and segmentation using 
second derivatives, are found in [8]. 
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Abstract. The skeleton is an effective tool for shape analysis if its structure can 
be regarded as a faithful stick-like representation of the pattern. However, 
contour noise may affect this structure by originating spurious skeleton 
branches, so that skeletonization algorithms should include a pruning phase 
devoted to an analysis of the peripheral skeleton branches and, possibly, to their 
partial or total removal. In this paper, labeled skeletons are considered and the 
significance of a peripheral branch is evaluated by analyzing the type of 
interaction between the pattern subset corresponding to the peripheral branch 
and the pattern subsets corresponding to the skeleton branches adjacent to the 
peripheral branch. The proposed criteria for skeleton pruning are expressed in 
terms of four parameters, which as a whole describe the role that the pattern 
subset corresponding to the peripheral branch plays in the characterization of 
the shape of the pattern. 



1 Introduction 

The skeleton is a useful representation to consider when, in the framework of a 
structural description task, a decomposition into parts of a planar pattern is desired 
[1]. The parts are obtained in correspondence with the elements of a partition of the 
skeleton, and are as much perceptually significant as the structure of the skeleton is a 
faithful stick-like representation of the pattern. For this reason, the skeleton should 
not possess branches in correspondence with details of the contour which are regarded 
either as due to some kind of noise, e.g., digitization noise occurring under pattern 
rotation, or as not relevant in the specific problem domain. Since such branches 
generally exist, any skeletonization algorithm is required to include a pruning phase 
devoted to an analysis of the peripheral skeleton branches and, possibly, to a partial or 
total removal of these ones [2]. As a result, a significant and manageable skeleton 
should be obtained. 

Pruning is more effective when applied to labeled skeletons, i.e., with pixels 
labeled with their distance from the complement of the pattern. In fact, distance 
information allows one to identify the role of every skeletal pixel in the representation 
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of the pattern, in terms of the area contribution to the shape of the pattern given by the 
disc associated to that pixel. 

In this paper, the evaluation of the significance of a peripheral skeleton branch is 
done by analyzing in detail the relations between the tip of the branch (the end point) 
and a reference pixel in the branch, in terms of their associated discs, as well as the 
type of interaction between the pattern subset corresponding to the peripheral branch 
and the pattern subsets corresponding to the skeleton branches adjacent to the 
peripheral branch. The analysis is carried out in the context of steady subsets of the 
skeleton, the presence of which is assumed to account for the significant pattern 
subsets characterizing any shape. These pattern subsets are regions with almost 
constant thickness and regions with thickness larger or smaller than the thickness of 
their adjacent regions. Only the parts of the skeleton which are not steady subsets can 
be removed, if suitable criteria are verified. The criteria are expressed in terms of four 
parameters, which as a whole describe the role that the pattern subset corresponding 
to the peripheral branch plays in the characterization of the shape of the pattern 

In the next section, notations and preliminary notions are introduced, and in section 
3 the four parameters are described. In section 4, the pruning method is outlined, 
while pointing out the two phases characterizing the procedure. Finally, in section 5, 
some examples are shown and the performance of the proposed algorithm is 
compared with the one relative to two well known jut-based pruning algorithms. 



2 Preliminaries 

We refer to a binary picture, rid of salt-and-pepper noise, where F={ 1 } and B={0) are 
the foreground and the background, respectively. Without loss of generality, we 
suppose that F is a connected set, regardless of its connectivity order, and does not 
touch the frame of the picture. The 8 -metric and the 4-metric are respectively used for 
F and B. 

For any pair of pixels p and q in the picture, a path from p to 17 is a finite sequence 
of pixels starting from p and terminating in q, where each pixel belongs to the 
8 -neighborhood of the preceding pixel. The (d],d 2 )-weighted distance from p to q 
(with d 2 < 2di) is the length of a shortest path from p to q, where the distance between 
two successive pixels in the path is either equal to di if the pixels are 
horizontally/vertically adjacent, or equal to d 2 if the pixels are diagonally adjacent 
[3]. In this paper the (3,4)-weighted distance is taken into account, and d{p,q) will be 
used to denote the distance between p and q. 

For every p in F, the disc Dp centered on p is the set of pixels having distance from 
p not greater than the distance of p from B. A pixel p is the center of a maximal disc 
(cmd) if it does not belong to a shortest path between any of its neighbors and B. 

The distance transformation of F with respect to B is the process creating a multi- 
valued set, replica of F, which differs from F in having each pixel labeled with its 
distance from B. 
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For a pixel p labeled k, the reverse distance transformation is the process building 
the open disc with radius \k/df \ centered in p, and assigning to any of its pixels, say r, 
the label {k-d(p,r)). 

The skeletonization of F is a process leading to the extraction of a linear subset, the 
skeleton, which is spatially placed along the central regions of F. The skeleton is a 
stick-like representation of the pattern and, depending on the problem domain, is 
required to account for different shape properties, such as symmetry, elongation, 
width, and contour curvature. 

The skeletonization algorithm [4] we consider is modeled on the grassfire 
transformation [5] and detects the skeleton in correspondence with the regions of F 
where different fire fronts interact with each other. The found skeleton is union of 
simple digital arcs and curves, is centrally located within F and has the same topology 
as F. Its pixels (skeletal pixels) are either centers of maximal discs or pixels 
connecting pairs of such centers, and are labeled with their (3,4)-weighted distance 
from B. Note that, in the following, the letters used to denote skeletal pixels should be 
understood as denoting also their associated distance label. 

For any skeletal pixel p, let Np denote the set of skeletal pixels in the 
8-neighborhood of p and let np be its cardinality, then p is called end point if np=l, 
normal point if np=2 and branch point if np>2. A sequence of normal points having as 
extremes an end point and a branch point is called peripheral skeleton branch. 

A skeletal pixel p is called local maximum if all its neighbors in Np, except one, 
have label not greater than p and the remaining neighbor has label less than p. 

A skeletal pixel p is called local minimum if all its neighbors in Np, except one, 
have label not smaller than p and the remaining neighbor has label greater than p. 

Skeletal pixels which are local maxima, local minima, or are pixels with at least 
one neighbor with the same label are called critical skeletal pixels. 

The steady skeleton subsets are (union of) maximal sequences of critical skeletal 
pixels. Only maximal sequences constituted by at least three critical pixels, or by at 
most two pixels which are either both local maxima or both local minima, are taken 
into account. Two sequences are merged if they are separated by a set, constituted by 
at most three consecutive pixels, which is delimited by extremes of the two sequences 
whose distance labels differ by no more than the amount d,+ d^. 

For a peripheral skeleton branch traced from the end point, which is assumed as 
not belonging to a steady subset, the hinge pixel is either the first pixel belonging to 
the steady subset firstly met or the branch point, if no steady subsets are found in the 
branch. The sequence delimited by the end point and by the predecessor of the hinge 
pixel is the part of the branch to be analyzed and possibly pruned. This sequence is 
constituted by pixels with labels which are non decreasing, starting from the end 
point, and will be denoted by A. 



3 Pruning Parameters 

In a peripheral skeleton branch, let p and q respectively denote the end point and the 
hinge pixel, and let Dp and Dq be the discs created in correspondence of them by the 
reverse distance transformation. 
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Moreover, if Dp and Dq partially overlap each other, let Hp denote the subset of Dp 
not overlapped by Dq. Hp is the pattern subset which could not be represented by the 
skeleton, if the subset A would be removed. Since labels are not decreasing in A and p 
is a cmd not belonging to a steady subset, there results p<q and Dp with size smaller 
than the size of Dq. The label of p and the spatial position of p with respect to q 
determine the shape and the size of Hp. 

Four parameters are used to characterize Hp: they are IN, TH, RP, and SM. 
Meaning and definitions of the parameters are as follows. 

• IN is used to express the size of Hp with respect to Dp. 



IN=g-d(p,^). 



When IN>(f 2 , p belongs to Dq and has no neighbors in the complement of Dq. Hp is 
at most 40% of Dp See Fig la and lb. 

When 0<IN<(/2, p belongs to Dq and has at least one of its eight neighbors 
belonging to the complement of Dq. Hp is at most 60% of Dp. See Fig Ic. 

When IN=0, p belongs to the complement of Dq and is adjacent to Dq. The size of 
Hp is at most 65% of Dq. See Fig Id. 

When IN<0, p belongs to the complement of Dq and overlapping between Dp and 
Dq is not guaranteed. When overlapping occurs, Hp is at least 73% of Dp. See Fig le. 

• TH is defined for IN>0, and expresses the weighted distance from the background 
of the innermost pixel in Hp. 



TH= p if 

TH= p -dl if 

TH= p -2dl if 

TH= p -IN if 



IN=0 

0<IN<dl 

IN=(2dl-l) 

IN>dl and W(2dl-1) 



When TH < d 2 , all the pixels of Hp are adjacent to the background, see Fig. la. As 
for Figs. Ib-le, there results TH > d 2 - 

• RP is defined for IN>0 and gives the number of pixels belonging both to Hp and to 
the longest among the horizontal/vertical paths from the background to the 
innermost pixels of Hp. 



RP= int (p/di) 

RP= lH-maxn(p) - maxn(IN) 
with maxn(r) = int (r/iiy)-i-l 
maxn(r) = int (r/di) 



if IN=0 
otherwise 
if mod(r,di) > 0 
otherwise 



In Figs, la, lb, Ic and Id, RP is equal to 2, 3, 4 and 4, respectively. 

• SM represents the degree of smoothness of the contour arc of the region in 
correspondence with the terminal part of A, and is computed when A is constituted 
by at least three pixels pi, p2, p3, with pl=p. For this contour arc, we define four 
degrees of smoothness, from 0 to 3, depending on the relations between the labels 
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associated to pi, p2 and p3. SM is as greater as the contour arc is smoother, see 

Fig. 2. 
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Fig. 1. Different size of Hp, when IN varies. The examples are shown for ^=15 and p=\2. A 
black dot denotes p; gray squares denote pixels belonging both to Dq and to Dp; Hp is the set of 
circled pixels, a) IN=8; b) IN=5; c) IN=1; d) IN=0; e) IN=-1. 
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Fig. 2. SM represents the degree of smoothness of the contour arc (denoted by gray squares) of 
the region in correspondence with the labeled skeletal pixels. There results: a) SM=0; b) SM=1; 
c) SM=2; d) SM=3. 
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4 Pruning Process 

Two phases are foreseen for pruning the skeleton subset A. In the first phase, Hp is 
estimated by referring only to the mutual positions of Dp and Dq. Three outputs are 
returned from this analysis: 1) retain A, 2) remove A, 3) more information about the 
context is necessary to take a decision. The second phase is accomplished only in the 
third case, which occurs when the hinge pixel is a branch point. This phase is 
concerned with the analysis of the interaction between A and the skeleton subsets 
adjacent to the branch point, and takes into account the parts of the pattern in 
correspondence with those branches, with special reference to their thickness. 



4.1 First Phase 

For IN>0, A is retained when the interaction between Dp and Dq is weak, the thickness 
of Hp is not negligible and the contour arc delimiting Hp has high curvature. In detail, 
A is retained if at least one of the following conditions holds: 

a) IN<r/, and TH>^f,; b) SM<3 and RP>SM and RP>1 and {m<d, or TH>J,) 

If IN>0 and the previous conditions are not verified, A is removed if the interaction 
between Dp and Dq is very high or if the thickness of Hp is negligible with respect to 
the size of Dp or if the region of the pattern associated with A is a negligible 
protrusion of the region associated with the steady subset including q. In detail, A is 
removed if at least one of the conditions following holds: 

a’) IN>(i 2 and TH^r/^; b’) TH^if^ and p<d ^ ; c’) ^ is not a branch point. 

Note that if q is not a branch point, then it is necessarily a pixel of a steady subset 
and the fact that conditions a), b) ,c), d) do not hold implies that Hp is negligible. 

The remaining cases occurring when IN>0 and ^ is a branch point are considered 
in the second phase. 

For IN<0, A is retained if at least one of the following conditions holds: 
a”) IN<-^f 2 ; b”) p>d ^ ; c”) SM<3 

In fact, in this case the interaction between Dp and Dq is either very low or does not 
occur. On the contrary, if none of the previous conditions holds, A is pruned if q is not 
a branch point, while any decision is postponed to the second phase when ^ is a 
branch point. 



4.2 Second Phase 

We consider the interaction between A and the skeleton subsets branching off the 
hinge pixel q and adjacent to it. We distinguish two types of skeleton subsets, say C 
and S, and denote by NC and NS the number of times these subsets are present. 




Spatial Relations among Pattern Subsets as a Guide for Skeleton Pruning 



161 



A subset C is constituted by pixels with labels which are non decreasing, starting 
from q, and no pixel of C is a steady pixel. See Figs. 3a, 3b, 3c. 

A subset 5 is a steady subset and is constituted by pixels with a same label, not less 
than q. See Figs. 3b, 3d and 3e. 




d) e) f) 



Fig. 3. Thick lines and dots represent steady subsets. 

If NC>1, see Fig. 3a, A is retained. 

If NC<1, see Figs. 3b, 3c, 3d, 3e, 3f, A is pruned if Hp is negligible, i.e., at least 
one of the following conditions holds: 

a) p<d, and SM=3; b) lN=d, and RP=1. 

On the contrary, A is retained if condition a) does not hold and IN<0. 

In the remaining cases occurring when NC<1, the interaction between A and the 
other skeleton subsets adjacent to q is considered. 

If NS>1, see Fig. 3d, or if NS=1 and NC=1, see Fig. 3b, A is generally retained, but 
it is removed when Hp is considered a negligible protrusion of the pattern subset 
associated both with q and with the skeleton subsets branching off q, i.e., if at least 
one of the following conditions holds: 

a’) m>d, and SM=3; b’) IN>1 and SM=3 and RP=1. 

If NS<1 and NC=0, see Figs. 3e and 3f, A is removed. 

If NS=0 and NC=1, see Fig. 3c, A is removed only if: IN>t /2 and RP<SM. 
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5 Discussion 

A pruning procedure implementing the previous criteria has been developed and 
tested on a data set of more than one hundred patterns, of which also a version rotated 
by 30 degrees has been considered. The results have been compared with the ones 
obtained by considering two well-known jut-based pruning algorithms (refer to [6], 
pp. 132- 133), for simplicity denoted here hy API and AP2. For both algorithms, a 
peripheral skeleton branch, delimited by an end point p, is traced from p to a pixel q, 
which is the center of a maximal disc and is the firstly met pixel which satisfies a 
given condition. If such a pixel q exists, the branch is shortened up to q, q excluded, 
otherwise the branch is removed, the branch point excluded. 

Algorithm API takes into account the sharpness of a protrusion, and in this paper q 
is the first pixel of a sequence of three pixels which are all centers of maximal discs. 
Pixel q may coincide with pixel p. As for AP2, q is such that X>d 2 , where X=p- 
q+d(p,q). The value X denotes the amount by which the protrusion is shortened when 
the branch is trimmed down to q. 

With reference to Figs. 4, we show in column a) the input patterns, together with 
their skeletons before pruning, as obtained by the algorithm described in [4]. The 
skeletons pruned according to the proposed method are shown in column b), while the 
skeletons pruned according to API and AP2 are respectively shown in columns c) and 
d). As a general remark, we may note that the proposed pruning algorithm shows a 
better performance as regards both stability under pattern rotation and preservation of 
significant peripheral skeleton branches. 











Fig. 4. a) Input; b) skeleton praned by the proposed algorithm; c) skeleton pmned by algorithm 
API. d) skeleton pruned by algorithm AP2. 
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Fig. 4. - Continued. 









a) 



b) 



c) 



d) 



Fig. 4. - Continued. 
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a) b) c) d) 

Fig. 4. - Continued. 
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Abstract. The focus of our paper is on the fitting of general curves 
and surfaces to 3D data. In the past researchers have used approximate 
distance functions rather than the Euclidean distance because of com- 
putational efficiency. We now feel that machine speeds are sufficient to 
ask whether it is worth considering Euclidean fitting again. Experiments 
with the real Euclidean distance show the limitations of suggested ap- 
proximations like the Algebraic distance or Taubin’s approximation. In 
this paper we present our results improving the known htting methods 
by an (iterative) estimation of the real Euclidean distance. The perfor- 
mance of our method is compared with several methods proposed in the 
literature and we show that the Euclidean htting guarantees a better 
accuracy with an acceptable computational cost. 



1 Motivation 

One fundamental problem in building a recognition and positioning system based 
on implicit 3D curves and surfaces is how to fit these curves and surfaces to 3D 
data. This process will be necessary for automatically constructing CAD or other 
object models from range or intensity data and for building intermediate rep- 
resentations from observations during recognition. Of great importance is the 
ability to represent 2D and 3D data or objects in a compact form. Implicit poly- 
nomial curves and surfaces are very useful representations. Their power appears 
by their ability to smooth noisy data, to interpolate through sparse or miss- 
ing data, their compactness and their form being commonly used in numerous 
constructions. Let f 2 (x) be an implicit polynomial of degree 2 given by 

f 2 {x) = po + x' ■ Pi + x' ■ P 2 ■ X = 0, a; S or a; S . (1) 

Then, we only have to determine the set of parameters which describes the data 
best. The parameter estimation problem is usually formulated as an optimiza- 
tion problem. Thereby, a given estimation problem can be solved in many ways 
because of different optimization criteria and several possible parameterizations. 
Generally, the literature on fitting can be divided into two general techniques: 
clustering (e.g. m) and least-squares fitting (e.g. [^IhlYj ). While the clustering 
methods are based on mapping data points to the parameter space, such as the 
Hough transform and the accumulation methods, the least-squares methods are 
centered on finding the sets of parameters that minimize some distance measures 
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between the data points and the curve or surface. Unfortunately, the minimiza- 
tion of the Euclidean distances from the data points to a general curve or surface 
has been computationally impractical, because there is no closed form expression 
for the Euclidean distance from a point to a general algebraic curve or surface, 
and iterative methods are required to compute it. Thus, the Euclidean distance 
has been approximated. Often, the result of evaluating the characteristic poly- 
nomial f 2 {x) is taken, or the first order approximation, suggested by Taubin ^ 
is used. However, experiments with the Euclidean distance show the limitations 
of approximations regarding quality and accuracy of the fitting results. 

The quality of the fitting results has a substantive impact on the recogni- 
tion performance especially in the reverse engineering where we work with a 
constrained reconstruction of 3D geometric models of objects from range data. 
Thus it is important to get good fits to the data. 

2 Fitting of Algebraic Curves and Surfaces 

An implicit curve or surface is the set of zeros of a smooth function / : R" — >■ 
of the n variables: Z{f) = {x : f{x) = 0}. In our applications we are interested 
in three special cases for their applications in computer vision and especially 
range image analysis: Z{f) is a planar curve if n = 2 and /c = 1, it is a surface 
if n = 3 and k = 1 and it is a space curve if n = 3 and k = 2. 

Given a finite set of data points T> = {xi}, 
is the problem of fitting an algebraic 

curve or surface Z{f) to the data set V is 
usually cast as minimizing the mean square 
distance 

^ m 

— dist (a^i, Z(/))^ — >■ Minimum (2) 

i=l 

from the data points to the curve or surface 
Z{f), a function of the set of parameters of 
the polynomial. The problem that we have to deal with is how to answer whether 
the distance from a certain point Xi to a set Z{f) of zeros of / : M" — >■ is the 

(global) minimum or not. The distance from the point Xi to the zero set Z{f) 
is defined as the minimum of the distances from Xi to points Xt in the zero set 

Z{f) 

dist(a;i,Z(/)) = min{|| Xi - a;* || : /(a;*) = 0} . (3) 

Thus, the Euclidean distance dist{xi, Z{f)) between a point x^ and the zero set 
Z{f) is the minimal distance between Xi and the point Xt in the zero set whose 
tangent is orthogonal to the line joining Xi and Xt (see FigllJ. As mentioned 
above there is no closed form expression for the Euclidean distance from a point 
to a general algebraic curve or surface and iterative methods are required to 
compute it. In the past researchers have often replaced the Euclidean distance 
by an approximation. But it is well known that a different performance function 




Fig. 1. Euclidean distance 
dist {xi, Z{f)) of a point Xi to a 
zero set Z{f) 
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can produce a very biased result. In the following we will summarize the methods 
used to approximate the real Euclidean distance by the algebraic distance and 
an approximation suggested by Taubin (ca, m)- 

Algebraic fitting. The algebraic fitting is based on the approximation of the 
Euclidean distance between a point and the curve or surface by the algebraic 
distance 

dist A (xi,Z{f)) = f 2 {x,) . ( 4 ) 

To avoid the trivial solution, where all parameters are zero, and any multiple 
of a solution, the parameter vector may be constrained in some way (e.g. ^ 
Wl and m ). The pros and cons of using algebraic distances are a) the 
gain in computational efficiency, because closed form solutions can usually 
be obtained, on the one hand and b) the often unsatisfactory results on the 
other hand. 

Taubin’s fitting. An alternative to approximately solve the minimization 
problem is to replace the Euclidean distance from a point to an implicit 
curve or surface by the first order approximation m There, the Taylor se- 
ries is expanded up to first order in a defined neighborhood, truncated after 
the linear term and then the triangular and the Cauchy-Schwartz inequality 
were applied. 

distr {x,,Z{f)) = (5) 

Besides the fact that no iterative procedures are required, the fundamental 
property is that it is a first order approximation to the exact distance. But, 
it is important to note that the approximate distance is also biased in some 
sense. If, for instance, a data point Xi is close to a critical point of the 
polynomial, i.e., ||V/ 2 (a:i)|| ~ 0, but f 2 {xi) yf 0, the distance becomes large. 
This is certainly a limitation. 

Note, neither the Algebraic distance nor Taubin’s approximation are invari- 
ant with respect to Euclidean transformations. 



2.1 Euclidean Distance 

To overcome the problems with the approximated distances, it is natural to 
replace them again by the real geometric distances, that means the Euclidean 
distances, which are invariant to transformations in Euclidean space and are 
not biased. For primitive curves and surfaces like straight lines, ellipses, planes, 
cylinders, cones, and ellipsoids, a closed form expression exists for the Euclidean 
distance from a point to the zero set and we use these. However, as the expres- 
sion of the Euclidean distance to other 2nd order curve and surfaces is more 
complicated and there exists no known closed form expression, an iterative op- 
timization procedure must be carried out. For more general curves and surfaces 
the following simple iterative algorithm will be used (see also Figl21): 
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1. Select the initial point In the first step we determine the initial solution 
by intersecting the curve or surface with the straight line defined by the 
center point Xm and the point Xi. By the initial solution, the upper bound 
for the distance is estimated. 

2. Update the actual estimation fc = 0, 1, 2, . . .. In the second 

step a new solution is determined. The search direction will be determined 
by the gradient of the curve V(/(a;p^)). 

+ ( 6 ) 

The method is an adaptation of the steepest descent method. As the result 
we get two possible solutions, and (cf. Fig|2I), and we have to 

decide by an objective function T ^ if will be accepted as new solution. 




Fig. 2. Steps to estimate the Euclidean distance dist E{xi, Z{f)) of a point Xi to the 
zero set Z{f) of an ellipse 



3. Evaluate the new estimation The set of solutions is 

evaluated by the objective function T{xi^x'f^^\z{f)) = 

min(dist_E(xi, a:[^^),dist£:(a;i, If the distance from the new es- 

timation a;[^~''^^ is smaller, we accept this as the new local solution. 
Otherwise a;[^~'’^^ = a;[^^ and r > 0. Then, the algorithm 

will be continued with step 2 until the difference between the distances of 
the old and the new estimation is smaller then a given threshold. To speed 
up the estimation a criterion to terminate the updating may be used like 
e-g- - xf\\ < Td, or k > t^. 

2.2 Estimation Error of Surface Fit 

Given the Euclidean distance error for each point, we then compute the curve or 
surface fitting error as dist_E(a;i, Z(/)). The standard least-squares method tries 
to minimize dist|;(a;i, Z{f)), which is unstable if there are outliers in the data. 
Outlying data can give so strong an effect in the minimizing that the parameters 
are distorted. Replacing the squared residuals by another function can reduce 
the effect of outliers. Appropriate minimization criteria including functions were 
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discussed in for instance |3 and M It seems difficult to select a function which 
is generally suitable. Following the results given in El the best choice may be 
the so-called Lp {least power) function: Lp := |dist£;(a;i, Z{f))\‘' /v. This function 
represents a family of functions including the two commonly used functions L\ 
{absolute power) with v = 1 and L 2 {least squares) with v = 2. Note, the smaller 
V, the smaller is the influence of large errors. For values v k, 1.2, a good error 
estimation may be expected na 



2.3 Optimization 

Given a method of computing the fitting error for the curves and surfaces, we 
now show how to minimize the error. Many techniques are readily available, 
including Gauss-Newton algorithm. Steepest Gradient Descent, and Levenberg- 
Marquardt algorithm. Our implementation is based on the Levenberg-Marquardt 
(LM) algorithm jiSI9j which has become the standard of nonlinear optimization 
routines. The LM method combines the inherent stability of the Steepest Gradi- 
ent Descent with the quadratic convergence rate of the Gauss-Newton method. 
The (iterative) fitting approach consists of three major steps: 

1. Select the initial fitting The initial solution is determined by 
Taubin’s fitting method. 

2. Update the estimation using the Levenberg-Marquardt 

(LM) algorithm. 

3. Evaluate the new estimation The updated parameter vector is eval- 
uated using the Lp function on the basis of the distE{xi,Z{f)). will 

be accepted if Lp{V^^~^^^) < Lp{V^^^) and the fitting will be continued with 
step 2. Otherwise the fitting is terminated and is the desired solution. 

3 Experimental Results 

We present experimental results comparing Euclidean fitting {EF) with Alge- 
braic fitting {AF), and Taubin’s fitting {TF) in terms of quality, robustness and 
speed. 



3.1 Robustness 

To test the robustness of the proposed EF method, we used three different 
surface types: cylinders, cones, and general quadrics. Note that plane estimation 
is the same for all three methods. To enforce the fitting of a special surface type 
we include in all three fitting methods the same constraints which describe the 
expected surface type. The 3D data were generated by adding isotropic Gaussian 
noise a = {1%, 5%, 10%, 20%}. Additionally the surfaces were partially occluded. 
The visible surfaces were varied between 1/2 (maximal case), 5/12, 1/3, 1/4, and 
1/6 of the full 3D cylinder (see Fig^. In all our experiments the number of 3D 
points was 5000. And finally, each experiment runs 100 times to measure the 
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Fig. 3. View of the 3D data points for a cylinder (maximal case) with added isotropic 
Gaussian noise a) cr = 1%, b) ct = 5%, c) a = 10%, and d) ct = 20%. 



average fitting error. The mean least power errors (MLPE’s) of the different 
fittings are in Tabnj We determined the real geometric distance between the 3D 
data points and the estimated surfaces using the method described in Sec l2.ll 
That means we calculated the MLPE for all fitting results on the basis of the 
estimated Euclidean distance. Otherwise, a comparison of the results will be 
useless. Based on this table we evaluate the three fitting methods with respect 
to quality and robustness. The EE requires an initial estimate for the parameters, 




Fig. 4. View of the 3D data points of partially occluded cylinders, a) 1/2 (maximal 
case), b) 5/12, 1/3 (see Figl^Jr)), c) 1/4, and d) 1/6. Added isotropic Gaussian noise 



cr = 5%. 



and we have found that the results depend on the initial choice. A quick review 
of the values in TabClshows that the results of TF are better for initializing than 
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the results of AF. Maybe another fitting method can give a better initialization, 
but here we use TF because of its advantages. 

As expected, the TF and EF yield the best results respect with to the mean 
and standard deviation, and the mean for EF is always lower than for the other 
two algorithms. The results of AF are not acceptable because of their high values 
for mean and standard deviation. The results of TF are much better, compared 
with the AF. But, in the direct comparison with the EF these results are also 
unacceptable. Furthermore, note that AF give sometimes wrong results which 
means that the fitted curve or surface typo does not come up with our expec- 
tations. We removed all failed fittings out of the considerations. The percentage 
of failures is given as footnote in TablH For TF and EF we had no failures in 
our experiments. 



3.2 Noise Sensitivity 

The second experiment is perhaps more important and assesses the stability of 
the fitting with respect to different realizations of noise with the same variance. 
The noise has been set to a relatively high level because the limits of the three 
methods are more visible then. It is very desirable that the performance is af- 
fected only by the noise level, and not by a particular realization of the noise. 
In TabClthe fi’s and cr’s are shown for four different noise levels. If we analyze 
the table regarding noise sensitivity, we observe: 

— The stability of all fittings, reflected in the standard deviation, is influenced 
by the noise level of the data. The degree of occlusion has an additional 
influence on stability. Particularly serious is the combination of both high 
noise level( a > 20%) and strong occlusion (visible surface < 1/4). 

— AF is very unstable, even with a noise level of cr = 1%. In some experi- 
ments with AF the fitting failed and the estimated mean least power error 
between the estimated surface and the real 3D data was greater than a given 
threshold. We removed all failed fittings, sometimes up to 23 percent (see 
Tabd fitting cylinder 1, 1/4 visible and a = 10%). Thus, the performance 
of the Algebraic fitting is strongly affected by the particular realization of 
the noise, which is absolutely undesirable. 

— TF is also affected by particular instances of the noise, but on a significantly 
lower level. 

— The noise sensitivity of EF has a similar good performance. The cause for 
the instability of the EF is the initialization. 



3.3 Sample Density 

In the third experiment we examined the influence of the sample density. The 
cardinality of the 3D point set was varied accordingly. On the basis of the MLPE 
for the several fittings (see TabI3) it can be seen that, with increasing the number 
of points, the fitting becomes a) more robust and b) less noise sensitive. Note, 
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Table 1. Least power error fitting cylinder 1 and 2. The visible surfaces were varied 
between 1/2 (maximal case), 5/12, 1/3, 1/4, and 1/6 of the full 3D cylinder. Gaussian 
noise a was 1%, 5%, 10%, and 20%. For AF the percentage of failed httings is given 
in brackets. The number of trials was 100. 





AF 


TF 


EF 




1 p cr ] • 10”^ 


1 p (7 ] • 10”^ 


1 p (7 ] • lO”'’ 


6/12 


[9.06T 1.55](o.o8) 


[ 1.14 ± 0.06] 


[ 0.71 ± 0.03] 


^ 5/12 


[33.19 ± 3.90](o.22) 


[ 1.30 ± 0.13] 


[ 0.65 ± 0.04] 


V 4/12 


[44.91 ± 3.72] (0.02) 


[ 2.75 ± 0.43] 


[ 0.72 ± 0.05] 


b 3/12 


[55.06 ± 2.04] (0.08) 


[ 3.82 ± 0.80] 


[ 0.94 ± 0.11] 


^ 2/12 


[58.05 ± 3.07](o.i3) 


[ 3.94 ± 0.49] 


[ 1.30 ± 0.10] 


S 6/12 


[14.80 ± 1.88](o.o3) 


[ 1.32 ± 0.17] 


[ 0.62 ± 0.03] 


5 g 5/12 


[36.11 ± 2.21](o.i4) 


[ 2.92 ± 0.13] 


[ 0.93 ± 0.12] 


“ " 4/12 


[25.56 ± 2.76](o.o8) 


[ 5.27 ± 0.32] 


[ 1.35 ± 0.32] 


b 3/12 


[55.13 ± 3.50](o.o2) 


[ 5.07 ± 0.75] 


[ 1.97 ± 0.45] 


g 2/12 


[55.93 ± 3.08](o.i5) 


[ 4.13 ± 1.03] 


[ 2.34 ± 0.81] 


^ 6/12 


[10.44 ± 1.09](o.io) 


[ 2.05 ± 0.29] 


[ 0.97 ± 0.12] 


g f 5/12 


[23.10 ± 3.47](o.i8) 


[ 3.48 ± 0.74] 


[ 1.71 ± 0.63] 


% g 4/12 


[37.76 ± 3.45](o.23) 


[ 3.90 ± 0.79] 


[ 1.78 ± 0.49] 


1 b 3/12 


[56.37 ± 2.73] (0.09) 


[ 4.28 ± 1.04] 


[ 1.83 ± 0.34] 


S 2/12 


[58.46 ± 3.33] (0.03) 


[ 8.98 ± 3.46] 


[ 3.02 ± 1.13] 


>> 6/12 


[15.34 ± 1.93](o.o3) 


[ 2.38 ± 0.35] 


[ 1.09 ± 0.10] 


i 5/12 


[49.40 ± 3.18](o.i3) 


[ 2.90 ± 0.56] 


[ 1.07 ± 0.07] 


4/12 


[55.61 ± 3.24](o.ii) 


[ 3.69 ± 0.64] 


[ 1.41 ± 0.11] 


; 3/12 


[22.49 ± 2.66](o.io) 


[ 4.05 ± 0.95] 


[ 1.72 ± 0.38] 


2/12 


[41.37 ± 3.22](o.i2) 


[ 9.20 ± 3.55] 


[ 3.00 ± 1.13] 


6/12 


[26.68 ± 1.37] (0.02) 


[ 6.40 ± 0.05] 


[ 4.26 ± 0.02] 


^ 5/12 


[21.82 ± 0.96] 


[ 6.83 ± 0.13] 


[ 3.68 ± 0.21] 


V 4/12 


[25.39 ± 0.88] 


[ 7.28 ± 0.25] 


[ 4.12 ± 0.17] 


b 3/12 


[21.93 ± 1.76](o.oi) 


[10.67 ± 0.74] 


[ 3.18 ± 0.48] 


§ 2/12 


[25.80 ± 2.26] 


[23.24 ± 1.67] 


[ 5.43 ± 0.70] 


'll 6/12 


[25.31 ± 1.26](o.o3) 


[ 6.67 ± 0.21] 


[ 4.73 ± 0.33] 


■3 ^ 5/12 


[18.54 ± 0.83] 


[ 8.06 ± 0.71] 


[ 3.22 ± 0.18] 


g II 4/12 


[25.32 ± 1.05] 


[ 8.38 ± 0.72] 


[ 5.26 ± 0.70] 


% b 3/12 


[18.29 ± 1.10](o.o7) 


[15.91 ± 1.60] 


[ 6.47 ± 1.08] 


S 2/12 


[40.08 ± 1.90] (0.02) 


[25.38 ± 1.57] 


[ 8.88 ± 1.36] 


II 6/12 


[26.27 ± 1.45](o.o9) 


[ 6.71 ± 0.23] 


[ 3.99 ± 0.30] 


1 i 5/12 


[19.31 ± 0.84] 


[ 7.46 ± 0.48] 


[ 3.79 ± 0.45] 


g 4/12 


[27.33 ± 0.84] 


[ 8.19 ± 0.90] 


[ 4.11 ± 0.52] 


g b 3/12 


[23.42 ± 1.92] 


[15.87 ± 1.70] 


[ 5.32 ± 0.85] 


1 2/12 


[31.89 ± 1.85](o.o2) 


[25.68 ± 2.04] 


[ 7.15 ± 0.92] 


6/12 


[24.74 ± 1.33](o.o2) 


[ 6.80 ± 0.27] 


[ 3.49 ± 0.13] 


" i 5/12 


[18.55 ± 0.87](o.o7) 


[ 7.11 ± 0.56] 


[ 3.95 ± 0.33] 


4/12 


[27.08 ± 0.90] 


[ 7.43 ± 0.35] 


[ 5.24 ± 0.35] 


; 3/12 


[22.32 ± 1.36](o.o4) 


[15.17 ± 1.79] 


[ 6.60 ± 0.91] 


2/12 


[35.18 ± 2.28](o.o2) 


[38.71 ± 8.81] 


[11.30 ± 2.22] 
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Table 2. Mean squares error for cylinder fitting by varied sample density. The density 
was 500, 1000, and 2000 3D points. Gaussian noise was a = 5%. For AF the percentage 
of failed fittings is given as a footnote. 





AF 


TF 


EF 




1 p cr ] • 10"^ 


1 p cr ] • 10"^ 


1 p cr ] • 10’^ 


6/12 


[15.88 ± 2.54](o.o3) 


[ 2.94 ± 0.80] 


[ 1.17 ± 0.30] 


5/12 


[29.73 ± 3.31](o.o3) 


[ 1.55 ± 0.11] 


[ 0.86 ± 0.06] 


§ 4/12 


[32.96 ± 2.55](o.o5) 


[ 4.39 ± 0.90] 


[ 2.30 ± 0.68] 


3/12 


[23.67 ± 3.01](o.o4) 


[ 3.81 ± 0.51] 


[ 1.55 ± 0.12] 


2/12 


[24.36 ± 1.51](o.o6) 


[ 6.86 ± 2.45] 


[ 4.37 ± 1.63] 


^ 6/12 


[16.61 ± 2.42](o.o8) 


[ 1.57 ± 0.17] 


[ 0.85 ± 0.11] 


d o 5/12 


[36.17 ± 3.52](o.i7) 


[ 3.52 ± 1.40] 


[ 1.75 ± 0.59] 


^ 8 4/12 


[35.06 ± 2.70](o.o6) 


[ 2.79 ± 0.33] 


[ 1.24 ± 0.15] 


^ ^ 3/12 


[22.01 ± 3.03](o.o2) 


[ 4.51 ± 0.71] 


[ 1.68 ± 0.16] 


“ 2/12 


[25.43 ± 1.91] 


[ 4.10 ± 1.17] 


[ 1.47 ± 0.12] 


6/12 


[16.85 ± 3.00](o.o8) 


[ 1.61 ± 0.26] 


[ 0.72 ± 0.03] 


o 5/12 


[45.99 ± 2.94] (0.07) 


[ 1.93 ± 0.43] 


[ 0.73 ± 0.04] 


8 4/12 


[38.38 ± 2.81](o.i5) 


[ 3.30 ± 0.76] 


[ 1.22 ± 0.29] 


3/12 


[19.79 ± 2.52](o.o6) 


[ 4.60 ± 1.16] 


[ 2.05 ± 0.56] 


2/12 


[24.28 ± 1.64](o.o7) 


[ 2.22 ± 0.28] 


[ 1.30 ± 0.08] 


6/12 


[23.27 ± 0.98] 


[ 7.08 ± 0.24] 


[ 4.10 ± 0.32] 


5/12 


[20.05 ± 1.94] 


[ 7.82 ± 0.56] 


[ 3.72 ± 0.22] 


1 4/12 


[25.19 ± 1.30] 


[10.55 ± 0.62] 


[ 4.66 ± 0.28] 


3/12 


[21.12 ± 1.85](o.o4) 


[17.10 ± 1.68] 


[ 7.18 ± 0.87] 


2/12 


[38.13 ± 2.77] (0.17) 


[24.74 ± 2.86] 


[ 7.90 ± 0.81] 


c 6/12 


]25.84± 1.28](o.o5) 


[ 7.11 ± 0.31] 


[ 3.97 ± 0.39] 


d o 5/12 


[20.19 ± 1.59] 


[ 7.60 ± 0.37] 


[ 3.33 ± 0.19] 


8 S 4/12 


[25.11 ± 0.89] 


[ 8.37 ± 0.40] 


[ 4.23 ± 0.17] 


^ ^ 3/12 


[20.61 ± 1.33](o.o6) 


[14.82 ± 1.21] 


[ 8.27 ± 1.03] 


“ 2/12 


[38.37 ± 2.73](o.io) 


[21.61 ± 1.76] 


[ 6.66 ± 0.84] 


6/12 


[24.30 ± l.ll](o.o2) 


[ 6.39 ± 0.17] 


[ 3.74 ± 0.29] 


o 5/12 


[18.95 ± 0.82](o.o2) 


[ 7.03 ± 0.32] 


[ 3.08 ± 0.17] 


8 4/12 


[27.31 ± 0.87] 


[ 7.98 ± 0.34] 


[ 4.36 ± 0.16] 


3/12 


[20.31 ± 1.02](o.o6) 


[14.92 ± 1.50] 


[ 5.36 ± 0.79] 


2/12 


[35.16 ± 2.15](o.o5) 


[24.90 ± 1.76] 


[ 8.82 ± 1.21] 



not only is the absolute number of points important, but the point density is 
crucial. 

However noise sensitivity increases with increasing occlusion for both TF and 
EF, so that the fitting becomes altogether more unstable. Similar conclusions 
about AF as in Sec J3. H and Sec 13. 21 also apply here. 
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3.4 Computational Cost 

The algorithms have been implemented in C and the computation was performed 
on a SUN Sparc ULTRA 5 workstation. The average computational costs in 
milliseconds per 1000 points for the three algorithms are in Tab0 

As expected, the AF and TF supply the best performance, because the 
EF algorithm requires a repeated search for the point Xt closest to Xi and the 
calculation of the Euclidean distance. A quick review of the values in TabElshows 
that the computational costs increase if we fit an elliptical cylinder, a circular or 
an elliptical cone respectively a general quadric. The algorithm to estimate the 
distance by the closed form solution respectively the iterative algorithm is more 
complicated in these cases (cf. Sec J2. 111 . 

The number of necessary iterations is also influenced by the required precision 
of the LM algorithm to terminate the updating process. 



Table 3. Average computational costs in milliseconds per 1000 points. 





Average computational costs [msec.] 


AF 


TF 


EF 


Plane 


0.958 


1.042 


2.417 


Sphere 


1.208 


1.250 


3.208 


Circular cylinder 


3.583 


3.625 


12.375 


Elliptical cylinder 


13.292 


13.958 


241.667 


Circular cone 


15.667 


15.833 


288.375 


Elliptical cone 


15.042 


15.375 


291.958 


General quadric 


18.208 


18.458 


351.083 



4 Conclusion 

We revisited the Euclidean fitting of curves and surfaces to 3D data to investigate 
if it is worth considering Euclidean fitting again. The focus was on the quality 
and robustness of Euclidean fitting compared with the commonly used Algebraic 
fitting and Taubin’s fitting. Now, we can conclude that robustness and accuracy 
increases sufficiently compared to both other methods and Euclidean fitting is 
more stable with increased noise. 

The main disadvantage of the Euclidean fitting, computational cost, has be- 
come less important due to rising computing speed. In our experiments the 
computational costs of Euclidean fitting were only about 2-19 times worse than 
Taubin’s fitting. This relation probably cannot be improved substantially in fa- 
vor of Euclidean fitting, but the absolute computational costs are becoming an 
insignificant deterrent to usage, especially if high accuracy is required. 

Acknowledgements. The work was funded by the CAMERA (CAd Modelling 
of Built Environments from Range Analysis) project, an EC TMR network (ERB 
FMRX-CT97-0127). 
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Abstract. Loss of information in images undergoing fine-to-coarse im- 
age transformations is analized by using an approach based on the theory 
of irreversible transformations. It is shown that entropy variation along 
scales can be used to characterize basic, low-level information and to 
gauge essential perceptual components of the image, such as shape and 
texture. The use of isotropic and anisotropic fine-to-coarse transforma- 
tions of grey level images is discussed, and an extension of the approach 
to multi-valued images is proposed, where cross-interactions between the 
different colour channels are allowed. 



1 Introduction 

The sensing process in vision can be seen as a mapping T from the state of the 
world into a set of images. Suppose that some detail of the state of the world 
we are interested in has a representation g] the map T applied to g produces 
an image I: T{g) = I. The representation g can be for instance the shape of 
some surface, a description of the meaningful part of the image, or a function 
indicating if an object is present in the image. This raises the issue of how visual 
information, the information contained in an image, can be encoded in a way 
that is suited for the specific problem the visual system must solve. The most 
basic image representation, and indeed the one closest of the process of image 
formation can be derived as follows: suppose the image domain D to be lattice 
of N pixels and let the intensity at each pixel s be given by the number n of 
photons reaching s, thus the image is determined by their distribution. To each 
pixel, then, it is possible to associate a number of photons whose distribution 
can be seen as the realization of a random field T . The information, in the sense 
of Shannon, can then be computed by considering photons distribution on the 
retina, and different measures of information, can be obtained, depending on the 
constraints imposed on the distributions Q. However, the information obtained 
this way is defined on the set of all possible images, given the constraints; hence. 
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to all images is assigned the same information and, in particular, it is not possible 
to discriminate among region with different information content. 

Here we shall present a different approach, based on the idea that entropy, 
defined in the sense of statistical mechanics, measures lack of information |2|. 
Images are considered as an isolated thermodynamical system by identifying the 
image intensity with some thermodynamical variable, e.g. temperature or con- 
centration of particles, evolving in time. It will be shown that entropy production 
in fine to coarse transformations of image representations implicitly provides a 
measure of the information contained in the original image; since, a local measure 
of entropy can be defined by means of the theory of irreversible transformations 
0, parts of the image with different information content will be identified with 
those giving rise to different amount of local entropy production. Some exam- 
ples will be given, illustrating how the identification of such different regions 
can provide a preliminary step towards the inference of shape and texture. The 
generalization of the approach to vector- valued images is also discussed. 

2 Information from Irreversible Transformations 

Let 17 be a subset of R^, {x,y) denote a point in 17, and the scalar field / : 
(a;,?/) X t — >■ I{x^y,t) represent a gray-level image. The non-negative parameter 
t defines the scale of resolution at which the image is observed; small values of t 
correspond to fine scales, while large values correspond to coarse scales. A scale 
transformation is given by an operator T that takes the original image /(',0) to 
an image at a scale t, namely, Tt : 7(-,0) — >■ We assume first T to be a 

semi-dynamical system, that is, a non-invertible or irreversible transformation 
which cannot be run backward across scale. Irreversibility of T ensures that 
the causality principle is satisfied, i.e. any feature at coarse level “controls” the 
possible features at a finer level of resolution - but the reverse need not be 
true. As a consequence of these assumptions every image point will, under the 
action of T, converge to a set of equilibrium points, denoted by the set I* , while 
preserving the total intensity 0. 

Let I* denote the fixed point of the transformation, that is, /*(•, 0) = 
for all t, and let I* be stable, that is lim 4 _).oo !{■, t) = !*{■)■ Relevant information 
contained in an image / can be measured in terms of the Kullback-Leibler dis- 
tance between / and the corresponding fixed point I* under the transformation 
T. In g], the conditional entropy was introduced as the negative of 

the Kullback-Leibler distance = —f f{x, y, t) In dxdy. Here 

/ and /* are the normalized versions of I and /*, respectively. The ’’dynamic 
information” of the image, under the process T, is provided by the evolution of 
H{f\f*) across scales: 

^ J J^f{x,y,t)lnf*{x,y)dxdy, ( 1 ) 

where H{f) = —f f(x,y,t)ln f(x,y,t)dxdy is the Boltzmann- 

Gibbs entropy measure. It has been proven 0 that ^H{f) = 
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— / J^lnf{x,y,t)-^f{x,y,t)dxdy. If /* is constant over J7, then 
§lH{f\f*) = ^iJ(/),and the entropy production as defined by the ther- 
modynamics of irreversible transformations |3|, that isV= can be used 

to quantify the information loss, during a transformation from fine to coarse 
image representations For simplicity we have dropped the scale variable t 
in H and V. Fine-to-coarse transformations can be modelled, by a diffusion or 
heat equation jnj: 



df{x,y,t) 

dt 



V^f(x,y,t). 



In this case, it has been shown P] that 



( 2 ) 



where 



V = 



f{x,y,t)a{x,y,t)dxdy, 



( 3 ) 



a{x,y,t) 



Vf{x,y,t) ■ \7f{x,y,t) 

f{x,y,t)^ 



( 4 ) 



is the density of entropy production in the thermodynamical sense 0. Since 
V > 0, H{f) is an increasing function of t. Note, also, that lim a = lim V = 

t—¥oo t—¥CO 

0. By using V it is possible to gauge, at any t, the information loss of the 
process through its entropy increase. Furthermore, this can be done locally, at 
each point of the pattern. Fig. 1 shows an example of local entropy production 
maps computed at different scales for the image ’’Aerial 1”. For representation 
purposes, u{x,y,t) has been linearly quantized in the interval [0,255] and the 
whole map rendered as a grey level image, where brighter pixels represent higher 
entropy production. It is easy to see that, during the transformation, a is large 
along borders defining the shape of objects (e.g., buildings), smaller in textured 
regions (e.g, vegetation), and almost zero in regions of almost constant intensity. 

In case of anisotropic transformations, e.g. anisotropic diffusion, the deriva- 
tion of the formula for the evolution of H{f\f*) is obtained in a way similar 
to the one outlined above. Consider the well-known Perona-Malik equation for 
anisotropic diffusion 



d 

2/, t) = div{x{x, y)Vf{x, y, t)), (5) 

where the neighborhood cliques which represent similarities between pixels can 
be defined by a conductance function, y, that depends on the local structure of 
the image. For instance, Perona and Malik |S| assume % to be a nonnegative, 
monotonically decreasing function of the magnitude of local image gradient. In 
this way the diffusion mainly takes place in areas where intensity is constant or 
varies slowly, whereas it does not affect areas with large intensity transitions. 
However, it has been widely noted that the choice of the conductance function, 
X is critical since an ill-posed diffusion may occur, in the sense that images close 



On the Representation of Visual Information 179 




Fig. 1. The image “Aerial 1” and three local entropy production maps computed at 
different scales. Each map, from left to right, is obtained at 1, 5, 20 iterations of the 
diffusion process. Brighter points indicate high density of entropy production. 



to each other are likely to diverge during the diffusion process [Z|. Here, we as- 
sume Nordstrom’s conjecture, namely we assume that, under a suitable choice 
of x(a:,2/), and for t — >■ oo the stationary point /* of the transformation corre- 
sponds to a piecewise constant function with sharp boundaries between regions 
of constant intensity 0. In practice, such ideal fixed point can be obtained by 
defining /*(•) = with t* ^ 0 (as it is actually done in the simula- 

tions). Note that, in contrast with the isotropic case, the fixed point /*(•) of the 
transformation is constant with respect to time t , but not with respect to the 
direction of the spatial derivatives. It has been shown that, in this case, a 
local measure CTan of the variation rate of conditional entropy H(f\f*) can be 
defined by setting 

=V - S = J J^f{x,y,t)aan{x,y,t)dxdy, (6) 

where V = J f{x,y,t)a'^^{x,y,t)dxdy and S = 
I lo fix,yTt)a"„{x,y,t)dxdy, and aan can be written as the sum of two 
terms, denoted by and respectively, 



CTan{x,y,t) 



fx'an{x,y,t) - a'^^{x,y,t) 

XKx, y) 



X{x,y) 



X/f{x,y,t) ■ \/f*{x,y) 
f{x,y,t)f*{x,y,t) 



Numerical computations of dH{f\f*)/dt and P,S, performed on a data set 
of 120 natural images provide evidence that if x is a non-negative decreasing 
function of |V/|, 0 < 5 < P, so that V — S > 0. Further, the evolution of 
at least for t small, is determined by entropy production alone. 

The term cr"„ = x{x,y) is calculated by using an approxi- 

mated fixed point image f*{x,y) ~ f*{x,y) obtained by letting the anisotropic 
diffusion run for large t. The image in Fig. 2 shows the fixed point for the 
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’’Aeriall” image obtained for t = 2000 iterations and using a “balanced” 
backward sharpening diffusion as detailed in j?j. In the same figure, plots of 
dH{f\f*)/dt and V,S, as functions of t are also depicted. 




Fig. 2. The fixed point of “Aerial 1” obtained through 2000 iterations of anisotropic 
diffusion. The graph on the right of the image plots dH{f\ f*)/dt, V, and 5 as a 
function of scale, represented by iterations of the anisotropic diffusion process. Units 
are nats/t. 



In ^ it has been argued that Vclt = dS is the loss of information in the 
transition from scale t to t + dt, that is, V is the loss of information for unit 
scale. Intuitively, V is a global measure of the rate at which the image, under 
the action of T, looses structure. Analogously, the density of entropy production, 
being a function of x,y, measures local loss of information that is, the loss 
of information at a given pixel for unit scale. In other terms, a depends on 
the local structure of the image and thus enables us to define features which 
can exist at different scales (in different image regions). The density of entropy 
production implicitly defines regions of different information content, or saliency; 
furthermore it is clear, by definition, that a is large along edges, smaller in 
textured regions, and almost zero in regions of almost constant intensity. This 
property has been the basis for a method of region identification ^ |^. 

Let the activity across scales, in the isotropic case, be a„{x,y) = 
fo similarly, for anisotropic diffusion, can be defined the 

activities {x, y), a^> and a^" , with .The activity is then 

a function defined on the image plane. It is possible, by a thresholding procedure 
0 . 0 . to distinguish between two adjacent regions of different activity. Namely, 
every pixel in the image is classified as belonging to one of three classes: low 
information regions or Ltype regions (characterized by low activity), medium 
information regions or m-type (medium activity), and high information regions 
or /i-type (high activity). Such basic features represent a preliminary step, both 
in the isotropic and anisotropic case, towards the labelling of image regions in 
terms of basic perceptual components: smooth regions, textures and edges. An 
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example, which compares different results obtained in the anisotropic case vs. 
the isotropic case is provided in Fig. 3 . 




Fig. 3. Results of region identification by activity for image “Aeriall”. The left col- 
umn displays results obtained experimenting with the isotropic process: the top image 
presents h-type regions, whereas the bottom image presents m-type regions. The two 
images in the right column are the corresponding results obtained via anisotropic pro- 
cess. In both experiments, the activity has been computed by integrating over 100 
iterations. 



Note that the separation of regions of different activity entails the separa- 
tion of well defined objects, the buildings, for instance, characterized by high 
activity, from a cluttered/textured background (vegetation). Further, the use of 
anisotropic diffusion makes localization of different features in the image more 
precise. In particular, textural parts are more neatly encoded by m-type regions, 
whereas localization of shapes via edges is improved, since anisotropic diffusion 
avoids edge blurring and displacement, and confined to h-type regions. 
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3 Deriving Spatio-Chromatic Information 



In order to provide an extension of this framework to color images it is necessary 
to address the issue of multi-valued isotropic diffusion. A color image can be 
considered a vector-valued image, which we denote f{x,y,t) = (fi{x,y,t)) , 
where i = 1,2,3, then a fine-to-coarse transformation must be defined for the 
vector field f{x,y,t). Little work has been devoted to extending Eq. El to color 
images, while some do address anisotropic diffusion of vector-valued imagesjOl 
IE]. Here the problem is tackled by representing a color image as a system of 
independent single-valued diffusion processes evolving simultaneously: 



dMx,y,t) 

dt 



V'^fi{x,y,t), 



( 8 ) 



where i = 1,2,3 labels the color components, or channels, in the image. Un- 
fortunately Eq. 0 does not allow interactions among different color channel to 
take place, whereas there is a general agreement that the processing and in- 
terpretation of color images cannot be reduced to the separate processing of 
three independent channels, but must account to some extent for cross-effects 
between channels, whatever the channels employed, RGB, HSI, etc. Indeed, ex- 
ploiting cross-effects is an important point when complex images are taken into 
account, where complex interactions occur among color, shape and texture (cfr. 
Fig. 4). 




Fig. 4. The “Waterworld” image (color, in the original) presents complex interaction 
between colonr, shape and textnre 



It is well known Ej that diffusion (EqEl can be derived from a more general 
equation, namely ^ = —divJ, where J is the flux density or flow 0. Also, it 
is known that, in a large class of transformations, irreversible fluxes are linear 
functions of the thermodynamical forces expressed by the phenomenological laws 
of irreversible processes. For instance, Fourier’s law states that the component 
of the heat flow are linearly related to the gradient of the temperature. In this 
linear region then J = LX P], where X is called the generalized force and 
L is a matrix of coefficients. In this case the density of entropy production is 
given by P] (T = J • X = ■ LijXiXj where i,j = 1,2,3 label the three color 
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channels. Hence, we define, for each color channel, the transition from fine to 
coarse scale through the equation ^ = —divJi and, in turn, for each i, the flow 
density is given by Ji = chosen to model interactions 

among color components setting Xi = V {xy t) ) where 

Kij = Kji are symmetric coefficients weighting the strength of the interactions 
between channels i and j and whose maximum value is kh = 1. We then obtain 
the following system of coupled evolution equations: 



dfi 

dt 



= —div 










f.f.YA 



(9) 



Eq. ISIcan be developed as: 

fp-'^^dx dx dy dy’ dx dx dy dy’” 



( 10 ) 



Then, color evolution across scales, in the different channels, comprises a 
purely diffusive term and a nonlinear term that depends on the interactions 
among channels; if Kij = Sij, that is if the channels are considered as isolated 
systems Eq. cni reduces to Eq. 0 The local entropy production S{x,y,t) can 
then be computed as 








( 11 ) 



and in case of grey-level images Eq. reduces to Eq. 0 Again, by considering 
Lij = Kijfifj, using the symmetry property of the coefficients, Eq. can be 
written explicitly for color images: 



S{x,y,t) 



E 



V/, • V/, 

p 

J I 



+ E 



V/. ■ V/, 

hfj 



( 12 ) 



The density of entropy production S is, not surprisingly, made up by two 
terms: ai = , the density of entropy production for every channel i, 

considered in isolation, and the cross terms that accounts for the 

dependence of entropy production on interactions among channels. Obviously 
if color channels are considered isolated S = Hy definition, it is clear 

that A, depends on the local spatio-chromatic properties of the image at a given 
scale t. Then, S measures the local loss of spatio-chromatic information, that 
is, the loss of information at a given pixel for unit scale. More formally, given 
a finite number k = 1,2, .., AT of iterations, it is possible, to sample the spatio- 
chromatic density of entropy production as S{x, y, tk)- Then, at each point (a;, y) 
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of the image, a feature vector v = [E{x,y,ti), S{x,y,t 2 ), S{x,y,tK)]'^ can 

be defined, which captures the evolution of S at {x, y) across different scales. A 
clustering procedure is then applied to the vector v, a cluster being a natural 
group of points with similar features of interest. From a perceptual point of view, 
as previously discussed, we are interested in the partitioning of l-type regions, 
as opposed to /i-type and m-type regions; thus, we partition the cluster space 
in three regions through the well known C-means clustering algorithm m The 
results presented in Fig 5 show that, as expected, most of the h and m-type 
points are encoded in the luminance channel; chromatic information contributes 
mostly to m-type points and little to /i-type points. 




Fig. 5. Region identification for image “Waterworld” . The left column displays re- 
sults on the luminance channel: the top image presents h-type regions, whereas the 
bottom image presents m-type regions. The two images in the right column are the 
corresponding results on the opponent channels. 



4 Discussion and Conclusion 

In this note it has been presented a method to derive measure local and global in- 
formation contained in a single image. It must be noted that the results depends 
not just on the image but also on the specific type of irreversible transformation 
the image undergoes to. Thus, in anisotropic diffusion the rate of change of infor- 
mation across scales does not depend, as in the isotropic case, solely on entropy 
production; due to the characteristics of the process, the loss of information is, 
at least, partially prevented by a term that depends on the degree of parallelism 
between the gradient of the image at scale t and that of the image representing 



On the Representation of Visual Information 185 



the fixed point of the anisotropic diffusion process. An extension of the frame- 
work to color or vector-valued images has been presented. We have proposed an 
evolution equation based on the assumption that a multi-valued image is a com- 
plex isolated system, whose components, namely, the color components, interact 
with each others through a generalized thermodynamical force. 

Note that the model proposed here for measuring pattern information is dif- 
ferent from the classical model {a la Shannon) which assumes the pattern itself 
as a message source. In our case, the message source is represented by the whole 
dynamical system, namely the pattern together with the chosen transformation. 
Eventually, the proposed framework provides a way for representing and com- 
puting image features encapsulated within different regions of scale-space. More 
precisey, the method derives features as those corresponding to “entropy rich“ 
image components. These constitute a preliminary representation towards the 
computation of shape and texture. 
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Abstract. Graph pyramids allow to combine pruning of skeletons with a 
concept known from the representation of line images, i.e. generalization 
of paths without branchings by single edges. Pruning will enable further 
generalization of paths and the latter speeds up the former. Within the 
unified framework of graph pyramids a new hierarchical representation 
of shape is proposed that comprises the skeleton pyramid, as proposed 
by Ogniewicz. In particular, the skeleton pyramid can be computed in 
parallel from any distance map. 



1 Introduction 

A major goal of skeletonization consists in bridging the gap between low level 
raster-oriented shape analysis and a semantic object description M n In 

order to create a basis for the semantic description, the medial axis HI 

is often transformed into a plane graph |Ogn94| . This task has been solved using 
the Voronoi diagram defined by the boundary points of a shape m n 93|Ogn941 
or by the use of special metrics on derived grids |Ber84l . In this paper we 
will propose a method that is not confined to a special metric (distance map) 
on a special grid nor on a special irregular structure like the Voronoi diagram. 

The new method starts with a regular or irregular neighborhood graph. The 
neighborhood graph reflects the arrangement of the sample points in the plane. 
The vertices of the neighborhood graph represent the sample points and the 
distances from the sample points to the boundary of the shape are stored in the 
vertex attributes. The edges of the neighborhood graph represent the neighbor- 
hood relations of the sample points. All illustrations in this paper refer to the 
regular neighborhood graph, in which the sample points represent pixel centers 
and the edges indicate the 4-connectivity of the pixels (Fig. ^). The vertex at- 
tributes reflect the Euclidean distance map (EDM) on a 4-connected set S of 

* This work has been supported by the Austrian Science Fund (FWF) under erant 
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pixels: each pixel p of S' is equipped with the Euclidean distance between the 
center of p and the closest pixel center outside of S |SL98| (Fig. [3 d). 

The dual of the neighborhood graph is referred to as crack graph. The edges 
of the crack graph describe the borders of the pixels. Each edge e of the neigh- 
borhood graph perpendicularly intersects exactly one edge e of the crack graph. 
The edge e is called the dual of e and vice versa. Each vertex of the crack graph 
stands for a pixel corner (Fig.|2|D). 

Dual graph contraction (DGC) |Kro95| is used to successively generalize the 
neighborhood graph by the removal and the contraction of edges. One level of 
the resulting graph pyramid will be called skeleton graph (Fig. |^). This term 
is justified by the fact that all centers of maximal disks (with respect to the 
distance map) are represented by vertices of the skeleton graph. Furthermore, 
the skeleton graph is always connected. 



I inner pixel 




boundary pixel around 
' — ' center of maximal disc 

I I other boundary pixel 




(a) 



(b) 



Fig. 1. (a) 4-connected pixel set. (b) Squared distances of (a). 



This paper is organized as follows: SectionOis devoted to the initialization of 
the attributes in the neighborhood graph and in the crack graph. In Section0the 
crack graph is contracted. Regarding the neighborhood graph this amounts to 
the deletion of edges that are dual to the ones contracted in the crack graph. The 
reduced neighborhood graph is called extended skeleton graph (since it contains 
the skeleton graph) . In Section El the extended skeleton graph is contracted to 
the skeleton graph. An overview of the different graphs and their relations is 
given in Fig|^. 
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(a) 



(b) 



Fig. 2. (a) Overview of the graphs and their relations, (b) Neighborhood graph (O) 
and crack graph (□) restricted to the sub-window in Fig.^. The numbers indicate the 
attribute values of the vertices and the edges. 



Like the skeleton, the skeleton graph is not robust. In Section El we propose 
a pruning and generalization method for the skeleton graph. It is based on DGC 
and yields the new shape representation by means of a graph pyramid. The 
pyramid proposed in jo^ can be obtained from the new representation by 
threshold operations. We conclude in Section 0 

2 Initialization of the Neighborhood Graph and the 
Crack Graph 

The neighborhood graph may be interpreted as digital elevation model (DEM), 
if the vertex attributes, i.e. the distances of the corresponding sampling points 
to the border of the shape, are interpreted as altitudes. Intuitively, the plan 
for the construction of the skeleton graph is to reduce the neighborhood graph 
such that the remaining edges describe the connections of the summits in the 
DEM via the crest lines of the DEM. In contrast to |KD94INGC92| our concept 
is a dual one: the neighborhood relations of the basins are described by the 
dual of the skeleton graph. In the next two sections it will turn out that the 
reduction of the skeleton graph depends only on the order of the values from 
the distance transform. Hence, we may use squared distances and thus avoid 
non-integer numbers. The idea for the reduction of the neighborhood graph is to 
remove edges that do not belong to ridges - thus forming the basins represented 
by the dual graph. The following initialization (Fig.Eb) will allow to control this 
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process. The first part refers to the neighborhood graph, the second to the crack 
graph (Fig. Hr): 

— Let dist^{v) denote the squared distance of the pixel that corresponds to 
vertex v. The attribute value of v is set to dist^{v). The attribute value of 
edge e = (u,v) is set to the minimum of dist^{u) and dist^{v). 

— The attribute value of edge e is set to the attribute value of edge e, where e 
denotes the edge in the neighborhood graph that is dual to e. The attribute 
value of vertex v is set to the minimum of the attribute values of all edges 
incident to v. 



3 Contracting the Crack Graph 

Recall, that the contraction of an edge in the crack graph is associated with the 
removal of the corresponding dual edge in the neighborhood graph |HK99a| . The 
neighborhood graph can never become disconnected: The removal of an edge e 
would disrupt the neighborhood graph, only if the corresponding dual edge e 
in the crack graph was a self-loop. DGC, however, forbids the contraction of 
self- loops. 

In order to get an intuitive understanding of the duality between contraction 
and deletion, we focus on the embedding of graphs on the plane (only planar 
graphs can be embedded on the plane) ll'S92l . An embedding of the neighbor- 
hood graph on the plane divides the plane into regions (FigH). Note that the 




(a) 





(d) 




Fig. 3. Duality of edge deletion in a plane graph (a)— >(b) and edge contraction in the 
dual of the plane graph (c)— >-(d). The regions ri, . . . ,T 4 in (a) are represented by the 
vertices rf, . . . , rff in (c). 



removal of an edge in the neighborhood graph is equivalent to the fusion of the 
regions on both sides of the edge. In terms of watersheds |M H.98) it is intuitive 
to fuse the regions of the neighborhood graph until each of the resulting regions 
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corresponds to exactly one basin of the landscape. Two neighboring regions 
may be fused, if there is no separating ridge between the regions. Due 
to the initialization of the attribute values in the crack graph, we may formulate 
a criterion for the fusion of two regions as follows : Let ri and C 2 denote 

two regions of the neighborhood graph and let r\ and denote the correspond- 
ing vertices in the crack graph. The regions ri and T 2 (ri r 2 ) may be fused, 
if there exists an edge e between ri and ri, whose attribute value equals the at- 
tribute value of ri or the attribute value of ri- Assume that the attribute value 
of ri is smaller or equal to the attribute value of ri. Then the fusion of r\ and 
C 2 is achieved by the contraction of rq into ri (Fig. Ofc). Thus, during the whole 
contraction process the attribute values of a vertex in the crack graph indicates 
the altitude of the deepest point in region represented by the vertex. 

Multiple fusions can be done by iterating the following parallel steps: 

1. For each edge of the crack graph that meets the above criterion for con- 
traction, mark the end vertex with the minimal attribute value. In case of 
equality choose one of the end vertices by a random process. 

2. Form a maximal independent set (MIS) of the marked vertices as explained 

in The MIS is a maximal subset of the marked vertices, no two 

elements of which are connected by an edge. 

3. Contract all edges that are incident to a vertex v of the MIS and that meet 
the above criterion for contraction (u being the end vertex with the minimal 
attribute) . 

The iteration stops, when none of the edges in the crack graph meets the above 
criterion. The resulting graph is called extended skeleton graph (Fig. Et)- It is 
connected and it still contains all vertices of the neighborhood graph. 

4 Contracting the Neighborhood Graph 

In this section the extended skeleton graph is further reduced to the so called 
skeleton graph. The skeleton graph 

— still must contain all vertices which represent maximal discs and 

— still must be connected. 

We focus on edges e = (u, v) such that v has degree 1 in the extended skeleton 
graph. The idea is to contract v into u, if we can tell by a local criterion that v 
does not represent a center of a maximal disc. All edges that have a degree-one 
end vertex to fulfill this criterion may then be contracted in parallel. 

Consider an edge e = {u, v) such that v is the end vertex with degree one. 
Using the notation of Section O i.e. dist{v) [dist‘^{v)] for the [squared] distance 
of a vertex v, we formulate the following criterion: If 

dist{u) — dist{v) = I, (1) 

the vertex v does not represent a center of a maximal disc and v may be con- 
tracted into u jSa.n M- 
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The distances dist{u) and dist{v) in conditionim) are integersQ. This follows 
from equation 

dist^{u) = {dist{v) + 1)^ = dist^{v) + 2dist{v) + 1 (2) 

and the fact that the squared distances are integers. 

In case of grids other than the square grid or in case of irregular samplings, 
Equation ^ generalizes to 

dist{u) — dist{v) =\\ u — v H 2 , (3) 

where |1 • H 2 denotes the Euclidean length oi u — v. In terms of , u is 

upstream of v. Repeated contraction of edges in the extended skeleton graph 
yields the skeleton graph (Fig. 

5 A New Hierarchical Representation for Shapes 

The new hierarchy is build on top of the skeleton graph. Besides pruning we also 
apply generalization of paths between branchings by single edges, as proposed 
in lUKDDhl . 

In order to asses the prominence of an edge e in the skeleton of a shape S, 
Ogniewicz 

1. defines a measure m for boundary parts b of S , i.e. the length of b, 

2. for each edge e determines the boundary part be of S associated with e (this 
is formulated within the concept of Voronoi Skeletons), 

3. sets the prominence of e to m{be). 

In our approach we also measure boundary parts by their lengths. However, we 
associate the boundary parts with vertices and thus define prominence measures 
for vertices. The initial prominence measure prom(y) of v indicates the number 
of boundary vertices (vertices representing boundary pixels) contracted into v 
including v itself, if u is a boundary vertex. (Fig. |3 l). This can already be 
accomplished during the contraction of the neighborhood graph (Section 0 by 

1. setting prom{v) to 1, if u is a boundary vertex, 0 otherwise (before the 
contraction), 

2. incrementing the prominence measure of w by the prominence measure of v, 
if V is contracted into w. 

Prominence measures will only be calculated for vertices that do not belong to 
a cycle of the skeleton graph. In the following, the calculation of the prominence 
measures from the initial prominence measures is combined with the calculation 
of the skeleton pyramid. 

Let the degree of a vertex u in a graph be written as deg{v) and let P denote a 
maximal path without branchings in the skeleton graph, i.e. P = (v\, V 2 , ■ ■ ■ , u„) 
such that 

^ Thus condition^) may be checked using only the squared distances and a look-up 
table. 
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Vi is connected to Vi+i by an edge for all 1 < i < n and 
degivi) = 2 for all 1 < j < n and 
deg{vi) yf 2, deg{vn) ^ 2. 




(a) Before concatenation. 



(b) After concatenation. 



Fig. 4. Concatenation of edges. 



Let €i = (ui,Vi) be an edge of P with Ui yf Vi and deg{vi) = 2 (Fig. Since 
deg{vi) = 2, there is a unique edge e' ^ Ci in P with e' = {vi,Wi), Wi ^ Vi. We 
assume that Ci does not belong to a cycle, i.e. Wi ^ Ui. If deg{wi) yf 1, we allow 
that Vi may be contracted into Ui. The contraction of can be described by the 
replacement of the two edges Ci and e' by a new edge e* (Fig.^o). 

The prominence measure of Ui is updated by 



This contraction process is referred to as concatenation. Due to the requirement 
deg{wi) 1 the prominence measures are successively collected at the vertices 
with degree 1. The result of concatenation on the skeleton graph in Fig. Eti is 
depicted in Fig. El>. 

After concatenation there are no vertices with degree 2. We focus on the set 
of junctions with dead ends, i.e. the set U of all vertices u with 

— deg{u) > 2 and 

— there exists an edge e and a vertex v with e = {u, v) and deg{v) = 1. 

The set of all edges that connect a vertex u G U with a vertex v of degree 1 is 
denoted by Ends{u). Note, that for each edge e there is at most one u G U with 
e G Ends{u). For each u € U let emin{u) denote an edge in Ends{u), whose end 
vertex with degree 1 has a minimal prominence measure. 

The prominence measures of the vertices with degree 1 induce an order in 
Ends{u), emin{u) being the least element. In case of | Ends(u) \> I we allow 
emin{u) to be contracted. Analogous to concatenation the prominence measure 
of u is updated to prom{u) := prom{u) + prom{v). 

In Fig. El the contraction is indicated by arrows: a white vertex at the source 
of an arrow is contracted into the black vertex at the head of the arrow. The 
new prominence measures are emphasized. 



prom{ui) := prom{ui) + prom{vi). 



( 4 ) 
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Fig. 5. (a) Extended skeleton graph. The vertices of the skeleton graph are given by 
the filled circles. The numbers indicate the initial prominence measures > 1. (b) After 
concatenation of the skeleton graph in Fig. The numbers indicate the prominence 
measures > 1. 




Fig. 6. (a) Contraction of vertices with degree 1. The numbers indicate the prominence 
measures > 1. (b) Ranks: bold 1, medium 2, thin 3. 
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For each set Ends{u) the operation of contraction followed by updating the 
prominence measure takes constant time. These operations can be performed 
in parallel, since the sets Ends{u), u € U are disjoint. Generalization of the 
skeleton graph consists of iterating concatenation followed by contraction (both 
include updating). 

In IUK95I a hierarchy of skeleton branches is established by a skeleton traver- 
sal algorithm. The traversal starts at the most prominent edge and follows the 
two least steep descents (starting at each end vertex of the most prominent 
edge). The highest rank skeleton branch consists of all edges that have been 
traversed by this procedure. Skeleton branches of second highest rank originate 
from the highest rank branch and are also least steep descent (ignoring the edges 
of the highest rank branch). Skeleton branches of third highest rank and so on 
are defined analogously. The edges of the skeleton are labelled according to the 
rank of the skeleton branch they belong to. 

Analogous ranks can be determined by imposing a restriction on the gener- 
alization described above: for each vertex u G U only one edge in Ends(u) may 
be contracted. This is achieved by initializing all vertices as vacant. Once an 
edge from Ends(u) was contracted, u is marked as occupied. The generalization 
up to the state, in which no further generalization can be done is summarized 
as first step. Thereafter, all vertices are marked as vacant again. The second 
step is finished, when occupation again forbids further generalization and so on. 
Thus, each edge of the concatenated skeleton graph that does not belong to a 
cycle is contracted. If n denotes the number of the last step, the rank of an edge 
contracted in step fc, (1 < fc < n) is set to 2 -|- n — k. Edges that belong to at 
least one cycle of the extended skeleton graph receive rank 1. 

The set of edges with rank smaller or equal tofc, (l<fc<n-|-l) always 
forms a connected graph. As in these graphs can be derived by a simple 

threshold operation on the concatenated skeleton graph according to the ranks 
of the edges. The ranks of the extended skeleton graph in Fig. are shown in 

Fig. Efc. 

6 Conclusion 

In this paper we have introduced a graph based hierarchical representation of 
shapes that comprises the skeleton pyramid as proposed by Ogniewicz. The new 
representation relies on the concept of graph pyramids by dual graph contraction. 
It allows to represent paths without branchings by single edges. This additional 
hierarchical feature is suggested for the hierarchical matching of shapes by means 
of their skeletons. 
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Abstract. A voxel-based method for flattening a surface while best pre- 
serving the distances is presented. Triangulation or polyhedral approxi- 
mation of the voxel data are not required. The problem is divided into 
two main subproblems: Voxel-based calculation of the minimal geodesic 
distances between the points on the surface, and finding a configuration 
of points in 2-D that has Euclidean distances as close as possible to the 
minimal geodesic distances. The method suggested combines an efficient 
voxel-based hybrid distance estimation method, that takes the continuity 
of the underlying surface into account, with classical multi-dimensional 
scaling (MDS) for finding the 2-D point configuration. The proposed al- 
gorithm is efficient, simple, and can be applied to surfaces that are not 
functions. Experimental results are shown. 



1 Introduction 

Surface flattening is the problem of mapping a surface in 3-D space into 2-D. 
Given a digital representation of a surface in 3-D, the goal is to map each 3-D 
point into 2-D such that the distance between each pair of points in 2-D is 
about the same as the corresponding geodesic distance between the points on 
the surface in 3-D space. 

It is known that mapping a surface in 3-D space into a 2-D plane introduces 
metric distortions unless they have the same Gaussian curvature 0. For exam- 
ple, a plane and a cylinder both have zero Gaussian curvature, since for the 
plane both principal curvatures vanish and for the cylinder one principal curva- 
ture vanishes. Therefore a plane bent into a cylindrical shape can obviously be 
flattened with no distortion. On the other hand, distortion-free flattening of a 
sphere onto the plane is impossible, because their Gaussian curvatures are dif- 
ferent. General surfaces are likely to have nonzero Gaussian curvatures, hence 
their flattening introduces some distortion. 

The need for surface flattening arises primarily in medical imaging (cortical 
surface visualization and analysis) and in computer graphics. One of the first 

* Corresponding author. 
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flattening algorithms was proposed in m- More efficient methods were later 
developed: In 0 local metric properties were preserved in the flattening pro- 
cess but large global distortions sometimes occurred; In nn a global energy 
functional was minimized; In Him angle preserving mappings were used. 
In computer graphics, surface flattening can facilitate feature-preserving tex- 
ture mapping (e.g. nnni). The 3-D surface is flattened, the 2-D texture image 
is mapped onto the flattened surface and reverse mapping from 2-D to 3-D is 
applied. 

In principle, the flattening process can be divided into two steps. First, the 
minimal geodesic distances between points on the surface are estimated. Then, 
a planar configuration of points that has Euclidean distances as close as possible 
to the corresponding minimal geodesic distances is determined. This paper com- 
bines a voxel-based geodesic distance estimator with an efficient dimensionality 
reduction algorithm to obtain a fast, practical method for surface flattening. The 
method presented is suitable for general surfaces in 3-D: the surface to be flat- 
tened does not have to be a function. A unique feature of our approach is that the 
algorithm operates directly on voxel data, hence an intermediate triangulated 
representation of the surface is not necessary. 

2 Voxel-Based Geodesic Distance Estimation 

Finding minimal geodesic distances between points on a continuous surface is 
a classical problem in differential geometry. However, in the context of digital 
3-D data, purely continuous differential methods are impractical due to the need 
for interpolation, their vast computational cost and the risk of convergence to a 
locally minimal solution rather than to the global solution. 

The common practice is to transform the voxel-based surface representa- 
tion into a triangulated surface representation (see HE! for an efficient trian- 
gulation method) prior to distance calculation. In j2!]j the distances between a 
given source vertex on the surface and all other surface vertices are computed 
in 0(n^ log n) time, where n is the number of edges on the triangulated surface. 
In [22j an algorithm that is simple to implement is given, but the algorithm runs 
in exponential time. In H3 and HS| the fast marching method on triangulated 
domains is introduced. The method computes the minimal distances between a 
given source vertex on the surface and all other surface vertices in O(nlogn) 
time, where n is the number of triangles that represent the surface. Texture 
mapping via surface flattening with the fast marching method on triangulated 
domains is presented in m- 

An algorithm for geodesic distance estimation on surfaces in 3-D, that uses 
the voxel representation without the need to first triangulate or interpolate the 
surface, was presented in m The method is based on a high precision length 
estimator for continuous 3-D curves that have been digitized and are given as 
a 3-D chain code m- The shortest path between two points on the surface is 
associated with the path that has the shortest length estimate and that length 
estimate corresponds to the geodesic distance between the two points. Therefore 
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an algorithm that calculates length estimates between points on a path can be 
combined with an algorithm to find shortest paths in graphs in order to find 
minimal distances on the surface. The computational complexity of the method 
is the same as that for finding shortest paths on sparse graphs, i.e. 0(n log n), 
where n is the number of surface voxels. 

Voxel-based, triangulation-free geodesic distance estimation Ea is a central 
component in the suggested surface flattening method. The digital surface is 
viewed as a sparse graph. The vertices and edges of the graph respectively corre- 
spond to voxels and to 3-D digital neighborhood relations between voxels. One of 
three weights (derived in fHI) is assigned to each edge, depending on the digital 
connection between the neighboring voxels: direct, minor diagonal or major di- 
agonal. The sparsity of the resulting weighted surface graph allows very efficient 
search for shortest paths based on priority queues. 

3 Flattening by Multidimensional Scaling 

A dissimilarity matrix is a matrix that stores measurements of dissimilarity 
among pairs of objects. Multidimensional scaling (MDS) |3j is a common name 
for a collection of data analytic techniques for finding a configuration of points 
in low-dimensional Euclidean space that will represent the given dissimilarity 
information by the interpoint Euclidean distances. 

Since usually no configuration of points can precisely preserve the informa- 
tion, an objective function is defined and the task is formulated as a mini- 
mization problem. So given a set of items {x^}, S Ji’’ with dissimilarities 
6{k,l) between items Xfc and xp the goal is to find p-dimensional data vectors 
{x/c}, Xfc S 5RP, p < r that have Euclidean distances {d{k,l)} that approxi- 
mate {i5(A:,^)} well. 

Many variants of MDS exist, differing in the objective function and opti- 
mization algorithms. These can be divided into two basic classes: metric and 
non-metric MDS. In metric MDS the original distance (or dissimilarity) matrix 
is approximated. Non-metric MDS deals with ordinal data or data in which only 
the order of the distances (dissimilarities) needs to be preserved. 

MDS is used as the second step in our two-step surface flattening approach. 
Once the minimal geodesic distances between points on the surface are esti- 
mated, they are represented as a matrix that serves as a dissimilarity matrix. 
The flattened 2-D point configuration that best preserves these distances is then 
obtained via a direct and simple metric MDS method, known as Classical Scal- 
ing. It provides an analytic solution, requires no iterations and is fast to compute. 
Classical scaling minimizes an objective function known as Strain. 

Let -k denote the pointwise product of two matrices, i.e. A * B = (flijbij) if 
A = (oij) and B = {bij). Given a dissimilarity matrix A (a symmetric matrix 
with nonnegative elements and zeroes on the diagonal that, in our case, con- 
tains the estimated geodesic distances between the surface points), the Strain 
minimization problem is 

nun ||D(X) * D(X) - A * A||| 
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where X is the n x 2 matrix of point coordinates in 3?^, D(X) is a, n x n matrix 
of Euclidean distances between the points in and || • ||f denotes the Frobenius 
norm, i.e. ||A|||h = trace(A'A). 

The double centering operator is defined as: 

T(A) = -ijAJ . 

Here A is a square n x n matrix, J = I — ^ee', e = (1, . . . , 1)' is an n-vector of 
ones and I is the n x n identity matrix. The double centering operator is a lin- 
ear operator on square matrices. Note that T(A) is symmetric if A is symmetric. 

The classical scaling algorithm consists of the following steps: 

— Apply double centering toA*A: B = T(A*A) 

— Compute the two (p = 2) largest eigenvalues (Ai, A 2 ) and their correspond- 
ing eigenvectors [vi|v 2 ] of B. Create A = diag(Ai, A 2 ) and Q = [vi|v 2 ] . 

— The output coordinate matrix is given by X = QA^/^ . 

In this method, partial eigendecomposition of an n x n symmetric matrix 
is required. For complete eigendecomposition, the symmetric QR algorithm is 
usually used, requiring O(n^) time. Since only the two largest eigenvalues and 
their corresponding eigenvectors are required, much faster algorithms can be 
applied, such as bisection, the power method and Rayleigh quotient iteration 
and orthogonalization with Ritz acceleration. See PI (section 7.8) and P2| for 
details. 

4 Interpolation 

Surfaces in 3-D space are often represented by tens of thousands of voxels. Al- 
though the interpoint geodesic distances between the surface voxels can be com- 
puted quite efficiently, applying the MDS algorithm to such a large amount of 
data may be inconvenient. We therefore first select a sample of surface voxels 
(say 1000) and apply the flattening procedure to this subset. Note that the min- 
imal distance between each pair of points in this sample is still calculated using 
the complete surface model. After flattening the sample, the flattened position 
of the remaining surface voxels is obtained by interpolation. Many different in- 
terpolation algorithms can be employed; radial function interpolation [91 1 1 )j was 
used in this research. 

The quality of the interpolation depends on the curvature of the surface and 
on the sampling method. Sampling adapted to the curvature of the surface, or 
uniform sampling of the surface, will produce the best results, but producing 
such samples may not be easy. However, if the curvature of the surface is not too 
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high, other sampling methods can still lead to good results. For example, when 
dealing with depth images z{x, y) of smooth surfaces, uniform sampling in the 
{x,y) plane is usually adequate. 

5 Experimental Results 

Fig. □ shows a synthetically created surface in 3-D space. The surface can be 
regarded as a rolled rectangular planar sheet of paper, with a uniform chessboard 
pattern. Note that, due to the rolling, this surface is not a function. 

Since the surface is a rolled plane it can, ideally, be flattened without any 
distortion. Applying the algorithm suggested in this paper to the surface yielded 
the flattened result shown in Fig. 0 The slight distortion is due to the small 
errors in the estimation of geodesic distances that are used as input to the MDS 
procedure. 

Fig. 0shows Euclidean distances on the flattened surface as a function of the 
corresponding estimated geodesic distances on the rolled surface in 3-D space. 
Since the original surface can ideally be flattened without distortion, had error- 
free geodesic distance estimates been available, all points in the graph would 
have been on the diagonal line. However, due to the slight errors in geodesic 
distance estimation, distances on the flattened surface cannot be identical to the 
geodesic distance estimates, leading to the small spread of points near the line. 

One application of the suggested surface flattening algorithm is texture map- 
ping. Fig. i is a depth image of a human face, in which depth is represented as 
brightness. Following flattening. Fig. 0 shows Euclidean distances on the flat- 
tened surface as a function of the corresponding estimated geodesic distances 
on the surface of the 3-D model. The larger scattering in this case around the 
diagonal line is due to the fact that this surface cannot be flattened without 
some distortion. Yet, the distortion is quite small. 

Fig. 0 (left) shows a brick wall texture. Overlaying this texture onto the 
flattened face surface and mapping the surface back to 3-D with its texture 
(using the known transformation) yields the texture-mapped face shown in 
Fig. El (right). 
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Fig. 1. A synthetic surface in 3-D space obtained by rolling a planar rectangular sheet. 
Note that this surface is not a function. 




Fig. 2. The flattened surface obtained using the method presented in this paper. 




Fig. 3. Euclidean distances on the flattened surface vs. the corresponding estimated 
geodesic distances on the 3-D surface. 
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Fig. 4. A depth image of a human face. Underlying 3-D model courtesy of NRC 
National Research Council of Canada, Institute for Information Technology, Ottawa, 
Canada K1AOR6, 1997. 




Fig. 5. Euclidean distance on flattened surface vs. estimated geodesic distance on 3D 
surface of the face. 
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Fig. 6. Left: A brick-wall texture. Right: The brick texture mapped onto the face. 
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Abstract. In this paper we propose a novel image interpolation algo- 
rithm based on the quadratic B-spline basis function. Our interpolation 
algorithm preserves the original edges while not destroying the smooth- 
ness in flat area using the adaptive interpolation method according to the 
directional edge pattern of input image, significantly improving the over- 
all performance of the interpolation. Our experimental result shows that 
it can produce higher quality and resolution than the currently existing 
image interpolation methods. 



1 Introduction 

intermediate value of continuous function from the given discrete samples. It is a 
conversion technique from some sampling rate to the other one Q . Image inter- 
polation can be used in correlation of spatial image distortion and image zooming 
system j2|. Image interpolation is often divided into two sub-processes: signal re- 
construction and sampling. The former creates a continuous function from the 
discrete image data, and the latter samples this to create a new, re-sampled im- 
age. If sampling frequency of input signal satisfies Nyquist sampling condition 
and frequency limitation, input signal perfectly reconstructs interpolated signal 
using an ideal interpolation function, sine function. However, since sine function 
is not time limited, it can not be implemented in hardware. Therefore, we need to 
design a proper interpolator which can be implemented in hardware, but close 
to sine function, that is, ideal low pass filter. The simple image interpolation 
methods comprise zero-order interpolation (or nearest neighborhood interpola- 
tion) and first order interpolation (bilinear interpolation) method. Both meth- 
ods can be cost-effectively realized in consumer electronics devices. These days 
they are not used, since they may degrade the quality of the interpolated image 
by blocky artifacts and excessive smoothness and more advanced interpolation 
schemes can be implemented due to the rapid advancement in VLSI. Hou and 
Andrews jS| proposed a more refined interpolation method Cubic B-spline inter- 
polation. However, it is hardly used due to its large computational complexity. 
Keys proposed another interpolation technique cubic convolution which reduces 
computational complexity of Cubic B-spline interpolation Unser presented 
a theoretical analysis of B-spline signal representation from a signal processing 
point of view, which is called B-spline transform |2| . 
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In this paper we propose a novel image interpolation algorithm based on 
the quadratic B-spline basis function. Our interpolation algorithm preserves the 
original edges while not destroying the smoothness in flat area using the adaptive 
interpolation method according to the directional edge pattern of input image, 
significantly improving the overall performance of the interpolation. Our exper- 
imental result shows that it can produce higher quality and resolution than the 
currently existing image interpolation methods. 

In section 2, the ideal interpolator and the conventional B-spline function 
are reviewed. In section 3, our interpolation algorithm is presented. In section 4, 
our experimental results are given in order to show that our method is superior 
to the other ones including Unser’s cardinal cubic spline interpolation method. 
Finally, in section 5, a conclusion is given. 



2 B-Spline Interpolation Methods 

2.1 Ideal Interpolation 

Ideal interpolator is theoretical concept for explaining interpolator. Some funda- 
mental properties of any interpolator can be derived from this ideal interpolation 
function. The ideal interpolator exchanges from positive value to negative value 
at the unit knot points known as zero-crossing points. If interpolators satisfy the 
following condition, they avoid smoothing and preserve high frequency compo- 
nent called edge. 

/i(0) = l, /r(x)=0, |x| = l,2,... (1) 

Ideal interpolator is spatially unlimited. There are two common approaches for 
spatially limited interpolators: truncation and windowing methods. Truncation 
of ideal interpolator produces ringing effects referred to as the Gibbs’s Phe- 
nomenon in the frequency domain because a mount of energy is discarded. Trun- 
cation technique is not sufficient in the image interpolation. Another approach, 
windowing is equivalent to the multiplication of ideal interpolator with less se- 
vere window than the rectangular function. With respect to the flat frequency 
response of pass-band, windowing is preferable to truncation method. 

2.2 B-Spline Interpolations 

commonly used family of spline function. B-splines of order n are piecewise 
polynomial functions of degree n. These functions are differentiable n—1 times, 
i.e. they have derivations up to order n—1. Any continuous n — th degree 
polynomial piecewise function which is also differentiable n—1 times can be 
represented using B-spline functions of the same order. In the case of uniform 
spacing between knot points, such a function can be represented in the form |S|: 

+ 00 

^ Cn{k) • P^{x - k) 

k——co 



( 2 ) 



An Adaptive Image Interpolation 207 





(a) (b) (c) (d) 



Fig. 1. Convolution property of B-spline weight functions: (a) Zero order (b) first order 
(c) quadratic (d) cubic 



where /3”(a;) is n — th order B-spline weight function; n is a degree of the piece- 
wise polynomials that are connected at the knot points. The function is 

uniquely determined by its B-spline coefficient Cn{k). 

The property of B-spline can be derived by several self-convolutions of a 
so-called basis function. If we define (3^{x) with 






1 , -5 < X < i 
0, otherwise 



( 3 ) 



then B-spline weight functions of any order satisfy the convolution property: 






( 4 ) 



as illustrated by figure l(a)^(d). 

Actually the first order B-spline can be considered as the result of convolving 
the rectangular zero order B-spline. For n = 2, we obtain the quadratic B-spline 
function defined by the following equation: 

ri-kp, 1^1 

!3\x) = l li\x\ - |)^ 4 < |x| < I (5) 

[ 0, otherwise 

Quadratic B-spline functions have been disregarded because they are known 
to be space variant and to introduce phase distortion in the output signal. But 
Dodgson showed this is not a general case, and recently derived a family of 
quadratic interpolator that is better behaved 0 . Also, Toraichi proposed another 
family of quadratic interpolator that has linear phase 0. For n = 3, we obtain 
the cubic B-spline function: 

filxp-lxp + l, |a:|<l 

/33(x)=n(2-|a:|)3, 1 < |a:| < 2 (6) 

[ 0, otherwise 

Cubic B-spline function is hardly used as interpolation function, since it fails to 
satisfy the interpolation condition (1). 
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Table 1. Transfer fnnction of B-spline weight function of order 0 to 3 



1 n 


1 0 


1 


2 


3 1 




sinc{w/2-K) 


sinc?{w /2 -k) 


4stnc'^ (tit/27r) 


Zsinc^ {w /‘2 'k) 


3+cos(tit) 


2 + cos(w) 




(a) (b) (c) (d) 



Fig. 2. Log-magnitude plot of B-spline weight functions: (a) Zero order (b) first order 
(c) quadratic (d) cubic 



Table 1 shows the transfer function of B-spline weight functions of order 0 
to 3. Figure 2 shows Log-magnitude response plot of B-spline weight functions. 
The transfer functions decrease too early, indicating that B-spline interpolation 
performs too much averaging. Increasing an order of spline not only improves 
the quality of interpolation but also increases the smoothing effects. 

3 Adaptive Image Interpolation Algorithm 

steps: Edge estimation and interpolation. In the edge estimation step, five types 
of edge are classified by investigating the direction of the edge. In the interpola- 
tion step, quadratic spline interpolation is performed by manipulating interpo- 
lation coefficients adaptively according to the types of edges. 



3.1 Edge Estimation 

We shall describe how to classify edges according to their directions. Using the 
four neighboring pixels around the center in the interpolation area as shown in 
figure 3, dl~d4 are obtained by calculating the absolute difference between them 
as follows: 



dl = |xl — x2\, d2 = \xl — x3\, d3 = |x3 — x4|, d4 = \x2 — xA\ (7) 









An Adaptive Image Interpolation 209 













x1 


x2 






x3 


x4 













Fig. 3. Adjacent 4 pixels for determining the edge 




Fig. 4. Flow diagram of determined directional edges 

Let dmax and dsmax be the first and second largest values among them respec- 
tively. Then, five types of edge patterns are determined according to dmax and 
dsmax by the edge detection algorithm shown in figure 4. We can easily figure out 
the direction of edge for each type as illustrated in figure 5. In the next subsec- 
tion, we shall show how to exploit the edge directional pattern in the quadratic 
spline interpolation. 

3.2 Adaptive Quadratic Spline Interpolation Algorithm 

In this subsection we shall show that a quadratic spline interpolator is derived 
from a quadratic B-spline weight function by calculating the weighted sum of 
quadratic B-spline weight functions, and it is a smooth piecewise polynomial of 
degree two with linear phase characteristics in the frequency domain. In inter- 
polation step, a continuous signal is reconstructed from its discrete samples as 
follows: 

+00 

f(x)= c{k)»q{x-k) (8) 

k— — oo 




Fig. 5. 5 types of edge for proposed adaptive interpolation 
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where c(fc) is a discrete-signal sample, and q{x) is a continuous impulse response 
of quadratic spline interpolator. Let us consider a discrete signal f{k) on k = 
—CO ~ - 1-00 defined as 



+ 00 

fW= X! c{m)»q{k-m) (9) 

m——oo 



If we assume that q{x) is a quadratic B-spline weight function 0^{x) given by 
equation (5), /3^(— 1) = 1/8, /3^(0) = 6/8, and 0^{+l) = 1/8, and we obtain 

f{k) = ~{c{k — 1) -I- 6 * c(fc) -I- c{k + 1)} (10) 

O 

which can be described in convolution form 



f{k) = c{k) * q{k) 

From equations (10) and (11), we can derive the following equation 

f{k) = c{k) * ~{S{k — 1) -I- 6 * 6{k) + 6{k + 1)} 

8 

In the frequency domain, Fourier transform of /(fc) is given by 
F{w) = C{w) • ^{3 -I- cos(rc)} 
and hence Fourier transform of c{k) yields 

C{w) = F{w) 



{3 -I- cos(w)} 

Then, the inverse Fourier transform of C{w) is obtained by 



( 11 ) 



(12) 



(13) 



(14) 



+ 00 +00 

c(x) = f(x)*V2 (272-3)1"! = 72 Y (272 - 3)1"-'=! • /(A:) (15) 

X— — 00 k— — oo 

Therefore, the continous function f{x) can be written by the weighted sum of 
the initial sample values f{i) as follows: 



+ 00 +00 

f(x)= Y c{k)»q{x-k)= Y f(.i)*hquadi.x-i) (16) 

k— — oo i——oo 

where the quadratic spline interpolator hguad{x) corresponding to the interpo- 
lation coefficient is determined by 



+ 00 

hquad{x) = V2 Y (272-3)l'=l/37x-fc) 

k——oo 



(17) 
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(a) (b) (c) 

Fig. 6. Quadratic spline interpolator and its magnitude response: (a) interpolator, (b) 
its linear magnitude response, (c) its Log-magnitude response 




Interpolated 

image 



Fig. 7. Block diagram of proposed algorithm 



Figure 6 illustrates quadratic spline interpolator, its magnitude frequency 
and log magnitude responses. Comparison with B-spline weight functions in fig- 
ure 2 shows that our quadratic spline interpolator has the excellent pass-band 
characteristic while satisfying the zero-crossing condition. Figure 7 shows the 
block diagram of our algorithm which adaptively manipulate the interpolation 
coefficients according to the edge directional patterns. Our method consists of 
following steps: First, edge detection is performed by using the proposed edge 
detection algorithm, which classifies five types of edge. At the same time, origi- 
nal image is stored into line buffer for calculating quadratic spline interpolator. 
In order to provide adaptive interpolation, weighted value generator produces 
parameters which controls interpolation coefficient according to the five types of 
edge. If edge is not detected in the support area of interpolation, interpolation 
is performed using the original coefficients; otherwise using the new coefficients 
which is converted from the original one by using the control parameters. 
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4 Experimental Results 

The performance of the proposed interpolation is tested in PSNR and subjective 
visual quality. Two types of images have been tested using the conventional 
interpolation methods and the proposed interpolation algorithm: natural image 
of 256x256x8 bit Lena image and graphical text image of 256x256x8 bit text 
image with more high contrast. 

4.1 PSNR Measure 

In order to have an objective evaluation of the proposed scheme, the test im- 
ages are 4 pixels (2x2 mask) average low-pass filtered, and decimated to the 
half-row and half-column size. Then, several interpolation methods with a scale 
factor 2 are applied to produce the original image including B-splines, cubic 
convolution, quadratic spline interpolation, cubic spline interpolation, and our 
proposed interpolation method. Table. 2 illustrates PSNR for each of the inter- 
polation methods, and show that our proposed interpolation method is superior 
to the other interpolation methods. 



Table 2. PSNR of the each interpolation methods(dB) 



Intp. 

Method 


Zero order 
B-spline 


First order 
B-spline 


Qnad. 

B-spline 


cubic 

B-spline 


Cubic 

conv. 


Quad. 

intp. 


Cubic 

intp. 


Proposed 

intp. 


Lena image 


52.23 


43.40 


59.29 


58.9 


59.46 


62.99 


63.01 


63.03 


Text image 


28.83 


26.08 


32.27 


32.09 


32.33 


33.54 


33.58 


33.63 



4.2 Image Interpolation Results 

Figure 8 shows the images obtained by applying several interpolation techniques 
for the magnification of a Lena image with a zooming factor 8. The first two 
images are the ones obtained by simple zero order B-spline (figure 8 (a)) and 
first order B-spline interpolation(figure 8 (b)) respectively. They show mosaic 
blocks (figure 8 (a)) that make the image visually degraded and image blurring 
(figure 8 (b)). Figure 8 (c) and figure 8 (d) show the images obtained by high 
order B-spline, that is, second order and third order B-spline respectively. They 
show severely blurred image. Therefore, we can see that increasing the order of 
the spline not only improves the quality of interpolated image but also increases 
the smoothing effect. Figure 8(e) and (f) show the images by Unser’s and our 
algorithms respectively. In both of them, the blocky effect is significantly re- 
duced, while maintaining the fidelity of the original image compared with the 
other interpolation methods. 

Figure 9 shows scale up images with scale factor 2 for text image by using 
the various interpolation methods: first order B-spline, second order B-spline, 
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Fig. 8. Magnification images of several interpolation with zoom factor 8: (a) Zero order, 
(b) First order (c) Second order, (d) Third order (e) Cardinal cubic spline (f) Proposed 
interpolation 



cardinal cubic spline interpolation and proposed interpolation algorithm. Figure 
9 (a) is the original text image that has 256 gray level and size of 256 x 256 
pixels. Figure 9(b) is low-pass filtered, and decimated to the half-row and half- 
column size of original text image. Then, several interpolation methods with a 
scale factor 2 are applied to produce the original image. In order to calculate the 
PSNR, decimation and interpolation processes are performed. Images obtained 
by applying first order B-spline, second order B-spline, cardinal cubic spline 
interpolation, and our interpolation method to the image in figure 9(b) are shown 
in figure 9 (c) through (f). The objective comparison of our method with the 
other ones are shown in Table. 2 using PSNR. It can be observed that the better 
result is achieved by our method than the other conventional methods. Compared 
to Unser’s method, our method shows better performance for text images rather 
than natural image. 



5 Conclusion 

In this paper we have presented a new adaptive image interpolation algorithm 
based on the quadratic B-spline basis function. Our interpolation algorithm con- 
sists of 2 major steps: Edge estimation and interpolation. In the edge estimation 
step, five types of edge are classified by investigating the direction of the edge. In 
the interpolation step, interpolation is performed by manipulating interpolation 
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Fig. 9. Scale up images with scale factor 2 using the various interpolation methods: 
(a) Original text image (b) Decimated image with scale down factor 2 from original 
image (c) First order B-spline (d) Second order B-spline (e) Cardinal cubic spline (f) 
Proposed interpolation 



coefficients corresponding to quadratic spline interpolator adaptively according 
to the types of edges obtained in the edge estimation step. We have shown that 
a quadratic spline interpolator is derived from a quadratic B-spline weight func- 
tion by calculating the weighted sum of quadratic B-spline weight functions, and 
its adaptation according to the edge patterns enables us to preserve the original 
edges while not destroying the smoothness in flat area, greatly enhancing the 
overall performance of the interpolation. In our experiments our method is com- 
pared with the previous ones for graphic and text images, and we have shown 
that it can produce higher quality and resolution than the currently existing 
image interpolation methods. Moreover, our method, which is close to ideal low 
pass filter characteristics, is being implemented in hardware now, since it can be 
done with less hardware than the Unser’s cardinal cubic spline interpolator. 
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Abstract. The usefulness of the 3D Medial Axis {MA) is dependent 
on both the availability of accurate and stable methods for computing 
individual MA points and on schemes for deriving the local structure 
and connectivity among these points. We propose a framework which 
achieves both by combining the advantages of exact bisector computar 
tions used in computational geometry, on the one hand, and the local 
nature of propagation-based algorithms, on the other, but without the 
computational complexity, connectivity, added dimensionality, and post 
processing issues commonly found in these approaches. Specifically, the 
notion of flow of shocks along the MA manifold is used to identify flow 
along special points and curves which define a shock scaffold. This ID 
scaffold is of lower dimensional complexity than the typical geometric 
locus of medial points which are represented as 2D sheets. The scaffold 
not only organizes shape information in a hierarchical manner, but is a 
tool for the efficient recovery of the scaffold itself and can lead to exact 
reconstruction. We present examples of this approach for synthetic data, 
as well as for sherd data from the domain of digital archaeology. 

Keywords: 3D Medial Axis, 3D Skeletons, 3D Symmetry Sets, shock 
hypergraph, shape representation. 

1 Introduction 

The Medial Axis {MA) or skeleton representation [3] has shown great potential 
in object recognition, in solid modeling for designing and manipulating shapes, 
in organizing a cloud of points into surfaces, for mesh generation, path planning, 
numerical tool machining, animation, etc. However, for the MA to be useful in 
these applications, it must often first be organized in a graph structure which 
embeds not only the qualitative aspects of shapes, e.g., parts, in a hierarchy of 
scales, but also the more detailed quantitative features. Traditionally, algorithms 
for 3D skeleton computation have typically focused on deriving the geometric 
locus of skeletal surfaces, thus leaving unclear the local connectivity in the in- 
terior of each MA sheet as well as in the joints, where three or more sheets 
come together. Also, while the MA as a transformation from object to symme- 
try coordinates is useful in itself, it does not address the issue of data reduction, 
since it is not a priori clear how to summarize the 2D MA sheets into a lower 
dimensional structure. While the interesting notion of curve skeletons has been 
proposed earlier [4,29], this is not for free form shapes of arbitrary complexity. 
A key goal of this paper is to address both problems by proposing the notion of 
a shock scaffold, upon which the remaining parts of the MA can be constructed 
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in a robust manner, and by developing an efficient computational scheme for 
obtaining this scaffold. 

Techniques developed to extract MA symmetries in 3D, can be roughly orga- 
nized into six main classes: (i) Thinning [18], (ii) Boundary modeling [30], (in) 
Voronoi diagram [23,2,19,1], (iv) Distance Transform [9,21,31,4,5], (v) Surface 
evolution [22,24,10], (vi) Bisectors computations and trimming in Computational 
Geometry [6,20]. We cannot review these here due to space limitations, but in 
some sense the ideal algorithm for the recovery of the 3D MA should combine 
the advantages of these approaches. Specifically, we seek a method which on 
the one hand features the exactness of bisector computations and Voronoi di- 
agrams, but when stripped of their tremendous computational burden, and on 
the other hand features the how-based nature of Blum’s grassfire [3,14], which 
underly thinning, distance transforms, and surface evolutions, but without their 
connectivity, added dimension and post-processing issues. 

A key insight which unifies these approaches in this work and the earlier 2D 
version [28,25,27,26] is that the full bisector need not be considered if a how-based 
approach is adopted. Specifically, if the initial sources of how are completely 
classified, one may only compute bisectors which initiate from these and ignore 
all others completely, leading to substantial savings. This also immediately leads 
to a graph structure which captures local connectivity and exact results. 

A second key feature which is specific to 3D is the need for dimension re- 
duction in computing 3D MA, which is accomplished by employing the notion 
of a shock scaffold. The complete classification of the 3D MA points and the 3D 
shock points, i.e., MA points augmented with a sense of how, was reported in 
[8]. Specifically, the MA points are formally classified into five types: one type 
corresponding to the interior of MA sheets, two types for MA curves, at the 
boundary of skeletal sheets, and two types for MA nodes, at the intersection 
of these boundaries. This classification, together with the notion of how along 
the curves, is the basis of constructing the shock scaffold, a reduced-dimension 
summary of the MA. The proposed algorithm identifies points of propagation 
from initial shocks sources, propagates along the shock scaffold, computes in- 
tersections among the junctions of this scaffold structure, until this propagation 
computation terminates at shock sinks. 

The resulting shock scaffold is a graph structure consisting of nodes (isolated 
points) and links (curve segments). Together with hyperlinks (surface patches), 
the shock scaffold gives rise to the shock hypergraph which is a complete repre- 
sentation of shape. The scaffold essentially allows us to ignore or approximate 
the medial surface patch geometry, or even the medial curve geometry, while 
retaining the connectivity among nodes and links which proportionally contain 
the most significant aspects of the MA. The algorithm is generic for any initial 
shape geometry, whether described by a cloud of points as described in this pa- 
per, or as a collection of surface patches as will be described in future work. The 
advantages of this framework are the reduced dimensionality, the exactness of 
the results, the efficiency of the algorithm, its applicability to unsegmented and 
unorganized data, and the immediate availability of a graph structure which can 
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be used in recognition and other applications. We illustrate the results for a set 
of synthetic examples as well as for sherds to be used for grouping fragments into 
reconstructed pots and other complex shapes found, in particular, for a project 
involving the archaeological site of Petra, Jordan [13]. 



2 The Shock Scaffold 



The classification of shock points described in [8] is based on the notion of contact 
with spheres, be., the loci of spheres osculating sources. Let A'^ denote a circle 
(in 2D) or a sphere (in 3D) osculating a boundary element at n distinct points, 
each with fc + 1 degree of contact. Figure 1: fc = 1 denotes regular tangency; 
k = 2 denotes a sphere of curvature for a surface patch; k = 3 denotes a sphere 
of curvature at a ridge point; k = 4 denotes a sphere of curvature at a turning 
point of a ridge, etc. [11]. 





Figure 1. Illustration of the notation A'^ 

(from [8]). A; + 1 counts order or degree of contact: A\ is regular tangent contact, A 2 
is regular “curvature” contact, A 3 is a curvature maximum contact. The superscript 
n counts the number of contact points, so that Af means two A\ contacts. A similar 
definition holds for the contact of surfaces with spheres. 



Only odd orders of contact (be., k = 1,3) can contribute to a MA type of 
shock, that is, as being the center of a maximal sphere, S. A classification based 
on the number and order of contact [8] leads to five principal types of shock 
points: A^, Af, A 3 , Af and A 1 A 3 : (i) A^ contact: a sphere with ordinary Ai 
contact at two source points generates a shock sheet point. The centers of such 
spheres sweep out a piece of the Symmetry Set (SS) which is locally smooth, (ii) 
A 3 contact: this is the limiting case of two A\ points, which corresponds in 2D 
to curvature extrema and in 3D to ridges on the boundary, (in) Af contact: the 
sphere, 5, has triple tangency on the bounding surface elements, M. Choose any 
2 of these 3 tangency points and move the sphere so that it remains bitangent 
to M at points close to these two. This results in a smooth sheet of the SS or 
MA for each pair, leading to a total of three such smooth sheets passing through 
the center of S. (iv) A 1 A 3 contact: it contains the centers of spheres which 
have contact with the surface in two places, one near the original Ai point (be., 
ordinary tangency) and one near the ridge point A 3 , (v) Af contact: the sphere 
is tangent to 4 source points - this is generic. At the center of the sphere passes 6 
smooth sheets of the SS (be., 6 pairs from 4 source points), two of which are not 
manifested in the MA, leading to four intersection of MA sheets. An alternative 
view of this event, is as the combination/intersection of four axial Af curves. 

Two observations are significant here. First, the topology of each of these 
types is as follows: Af points are interior points of a medial surface; A 3 points or- 
ganize into curves representing ridges on surfaces and are the “exterior” bound- 
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ary of medial surface sheets; Af points organize into curves which are the in- 
tersection of three medial surfaces sheets; these curves often correspond to 
“generalized axis” as well as to “interior” boundary of MA sheets; Af and A 1 A 3 
are isolated points where four Af or a pair of Af and A3 curves intersect, re- 
spectively. 




(b) 

Figure 2. 
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points (b). 

Second, one can construct a notion of flow for each MA point in the direction 
of increasing radius, which leads to a further subclassification of points as was 
done in [12] in 2D. Shocks, i.e., MA points endowed with a sense of flow, can 
flow along sheets (Af) or curves (A3 and Af) in various ways: they can flow 
monotonically (1st order), can act as a source and initiate flow (2nd order), or 
can act as a sink and terminate flow (4th order) , Figure 2a. Third-order shocks 
represent infinitely “fast” flows, which are not generic, but must be considered, 
especially for man-made objects. For nodes (A1A3 and Af), the classification 
is based on the number of incoming branches. Figure 2b. Table 1 summarizes 
the notation. This classification of the medial axis and shock points leads to an 
intriguing data structure for representing them. 



Definition 1 . Let the Af and A1A3 points of a 3D MA denote nodes, and 
the Af and A3 curves which connect these nodes denote links, which together 
with the radius function attribute form the shock scaflfold graph. In addition, 
let the Af surface patches, whose boundary is described by an ordered, closed 
sequence of nodes and links, act as hyperlinks on the graph to form the shock 
hypergraph. 



We illustrate the scaffold concept in Figure 3. Clearly, the scaffold represen- 
tation arises from a recognition of special types of points, i.e., an understanding 
of the local topology of each type of points as isolated point, curve, or sheet, and 
the connectivity among the five types. When this graph structure is ignored, 
a trace of MA points remains which is the classical view of the MA found in 
the literature. The advantage of the graph structure is that it organizes the MA 
information into groups and specifies their connectivity. It is precisely the con- 
nectivity among groups which contains the qualitative information, while the 
remaining information allows for an exact reconstruction or approximation of 
the shape from the shock hypergraph [7]. From the shock scaffold alone, we are 
still able to get a fairly good idea of the shape of the object due to the remain- 
ing connectivity, in the same way that a generalized axis (curve) represents a 
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Figure 3. The shock scaffold is illustrated for a few simple shapes. The dark broken 
lines are surface ridges (^3), the smaller dots are surface vertices (^1^3), the larger 
nodes are Af shocks, the interior links have arrows to indicate flow (all ^?’s here), the 
hashed sheets are hyperlinks {A\-, not all shown). The shock scaffold of a tetrahedron 
consists of 5 nodes, 10 links and 6 hyperlinks; for a truncated tetrahedron we have 
8 nodes, 7 links and 9 hyperlinks, (c) Sketch of the shock scaffold for a branching 
structure which at the top is a cylinder whose base grows from a triangle to an ellipse, 
and which splits into two cylindrical structures with elliptic bases. 

cylinder well. The MA can be approximated by interpolating the missing MA 
sheets, stretching smooth elastic surfaces over the bounding curves, much as is 
done when a “tent” is constructed. Similar arguments hold for the geometry of 
the curves such that at the very coarsest level only nodes need be retained. 

3 3D MA recovery by flow along scafiFold curves 

A key idea of this paper is that the shock scaffold is not merely a post-processing 
tool for organizing traced MA geometry. Rather, it is an essential element in the 
recovery process itself. The argument is based on substantial savings in 2D if 
a flow-based recovery of 2D shocks from boundary sources is adopted [25,27], 
where the flow permits the consideration of only the relevant bisectors. 

Consider the problem of deriving the MA of M surface patches Gi, i = 
1 , M. In computational geometry, the pairwise bisector Bij, i.e., the equidis- 

tant surface between a pair of models Gi and Gj, is computed for all such pairs 
and the results “trimmed” by removing those portions of Bij which are closer 
to a third source Gk, Figure 4a. This results in the exact MA, but is compu- 
tationally prohibitive. However, note that shocks can be considered as flowing 
along bisectors. The shock flow begins at certain initial sources and terminates 
at sinks. If the initial sources are identified and traced, only the viable bisectors 
are considered, thus tremendously reducing the computational effort. Figure 4b. 
While in 2D the flow along shocks is a ID process, in 3D, shocks flow along a 
vector field on a sheet. A key insight, which reduces the underlying dimension of 
the computational effort is that flow along shock curves {Af and A 3 ) is sufficient 
to recover the MA exactly. This is because all MA sheets are bounded by curves, 
and it is clear from each curve which two sources generate each sheet bisector. 
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Figure 4. In computational geometry, pairs of sources, represented as dark segments 
in (a), are used to compute the set of bisectors, shown in (a) as the remaining curves, 
(b) These bisectors are then “trimmed” to obtain the MA [27]. In our approach, we use 
a notion of propagation along bisectors which are initiated from valid sources (double 
arrows), thus avoiding the need to consider the numerous irrelevant bisectors, (c) The 
situation is similar in 3D: three point sources Gi, G 2 and Gs give rise to 3 bisector 
sheets and 1 trisector curve. The computation of only valid sources and flows along 
curves leads to tremendous improvements in efficiency. 



thus leading to an exact identification of sheets. Figure 4c, if the shock curves 
are available. 

Initial shock sources for curves are identified as follows. Along Af and A3 
curves, Af —2 and A3 — 2 are the only initial sources, respectively. Thus, it is 
sufficient to identify Af —2 and A3 — 2 MA points and propagate along Af and 
A 3 curves from these points until the curves come to an end, which can only 
happen at A1A3 and A3 — 4 points for A3 curves, and at either A1A3, Af— 4 or Af 
points for Af curves. The recipe for continuing the propagation at junctions is 
also straightforward: Shock curves entering an Af can either terminate {Af— 4 ), 
leave in a single (A|— 3) or two (A|— 2) outcoming branches; similarly, A 1 A 3 
shocks can be A1A3—3 (with A3 flowing in and Af flowing out), A1A3—2, or 
A 1 A 3 — 4 . The general algorithm for M sources is thus described at an abstract 
level as summarized below. 

1. Identify all MA initial shock sources. 

2. Identify the next junction by considering intersections of the A 3 and A\ 
curves with neighboring shock waves, and propagate these for each source. 

3. If an outcoming shock is available, use the junction as a source and go to 
step 2 (iteration). 

4. Output the shock scaffold graph. 

The details of this algorithm for extracting the shock scaffold of unorganized 
clouds of points can be found in [15]. 

4 Unorganized clouds of points 

In this paper we only consider shapes specified by a collection of unorganized 
point sources, e.g., as they arise from a laser scanner. The bisector for a pair of 
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point sources, Gi = {xi,yi,Zi) and Gj = (xj,yj,Zj) is a planar sheet orthogonal 
to the line joining the sources and passing through their midpoint, Mij = + 

Xj,yi + yj,Zi + Zj). The point Mij is the initial shock source Al~2 on the sheet 
Bij, Figure 4c, which is described by the implicit polynomial: 

2 [{xi - Xj)x + (jji - yj)y + {zi - Zj)z] + [{xj + y] + z]) - (x? + y^ + zj)] = 0 . 

Similarly, for three point sources, Gi, Gj and Gk, the Af shock curve is computed 
as the intersection of two shock sheets. The initial point of flow along this curve, 
the Af—2 point, is the circumcenter, O 3 , of the triangle defined by the three 
sources, which can be efficiently and robustly computed in terms of the edge 
lengths of the triangle as. 



O3 



Gi + 



/ b^{-d A^)\ 



+ 



f a^{t A 



rit. 



where it = GiG], V = GiCl, a = Hl^ll, b 



b and A denotes the area of 



the triangle. Next, for four point sources, Gi, Gj, Gk and Gi, the Af node is 
computed as the circumcenter, O 4 , of the tetrahedron defined by the four sources 
as. 



O4 — Gi -\- 



12V 



h~^)+ b‘^{-& A-d)+ a^(t A ■^) 



where = GiG], c = ||7^||, V denotes the volume of the tetrahedron, and #, 
b , a and b are defined as for the triangle case. 

Finally, no A 3 curves or A 1 A 3 points are possible for point sources. Detailed 
computation analysis, including numerical complexity and pseudocode listings 
are provided in [15]. We note that the computations for unorganized polygonal 
patches are much more intricate but nevertheless fully computable [16]. 




Figure 5. The shock scaffold is depicted for collections of (white) dot samplings, (a) 
Regular and (b) irregular samplings of a spherical cap. (c) Regular sampling on a 
cylindrical segment. 
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5 Results and Discussion 

We now present examples for dot samples of geometrically simple shapes, i.e., 
spherical caps, cylinders and a parallelepiped, as well as for dots sampling of the 
surface of an aorta section, a pot sherd and a full pot. An intrinsic challenge 
in presenting 3D MA results is the visualization of the results, which can be 
best seen interactively in 3D, and we invite the reader to visit www.lems .brown. 
edu/vision/researchAreas/Shocks3D/, but which must be conveyed with 2D 
snapshots here. 




Figure 6. (a) Input samples randomly distributed along half-cylindrical sections 
(shown as white spheres). Darker spheres indicate Af—2 and Af points, (b) Prun- 
ing: cutting away initial curve shock sources Af— 2 and associated branches leads to 
the central axis. 

The first set of examples illustrates the effect of grid sampling along spherical 
caps and cylindrical segments. Figure 5. Note the correct placement of Af—2 
points (grey) which identify initial shock curve sources. We only keep and show 
those curve segments which are connected to valid Af nodes. These curves also 
propagate to infinity (not shown), but this is easily and explicitly detected via 
the circumcenters of associated shock nodes. In Figure 5a the regular sampling 
of a spherical cap results in a single Af at the sphere center, as expected. In 
Figure 5b perturbations along the tangent space keep this geometry intact. The 
shock scaffold for a cylindrical segment is shown in Figure 5c, where we note that 
the “generalized axis” is readily visible. In Figure 6, randomly distributed points 
(white spheres) along sections of a half-cylinder are used as input. Application 
of a structural pruning strategy [26,17] lead us to directly retrieve the main axis 
of the original cylinder. Note that a significant goal in object representation is 
the approximation of data by generalized cylinder descriptions which are highly 
symmetric. 

While the first set of examples examines the correctness of the algorithm for 
simple situations, the next example examines the more complex geometry of a 
parallelepiped. Figure 7. Observe that Af points are placed as expected. This 
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Figure 7. Left to right: Initial sample points on the surface of a parallelepiped; full 
scaffold; scaffold with initial shock curves pruned away. 



parallelepiped has a large number of degeneracies, i.e., overlaps of Af—2 and 
Af shock points. A large portion of the scaffold is due to initial shock curves 
(Figure 7, middle) . Removing the initial curve shocks and their associated links 
reveals the internal structure of the scaffold (Figure 7, right). Shock links with 
infinite flow velocity are also identified. Furthermore, the direction of flow of 
regular links is also available explicitly (not illustrated here). 




Figure 8. Left: bottom part of an aorta scan (data from [29]). Middle: scaffold without 
the initial curve shocks. Right: simplified scaffold after using the combined structural- 
saliency pruning. 

In Figure 8 we show the result of applying the combined structural-saliency 
pruning [17] on the bottom branching part of the aorta data. Note that here, 
we did not display the medial structure in between the two “legs”, by using a 
maximum distance criterion, therefore displaying the scaffold structure only in 
the vicinity of the input data. 

In Figure 9, we show the effect of this pruning on a close-up of the top-section 
for the aorta data of Figure 8 (transversal section with respect to the aorta’s 
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main vertical orientation, seen from the side). Note that each loop (in Figures 8 
and 9) in the scaffold corresponds to a shock sheet (hyperlink). 




Figure 10. Left: original pot sherd, 5000 samples, obtained from a laser scanner. The 
frontal (middle) and side (right) views of the shock scaffold nodes. 

Our last examples are shown for samples taken from pottery excavated at the 
archaeological site of Petra, Jordan. The samples are obtained via laser scanning. 
The pruned shock scaffold (nodes only) is depicted in frontal (b) and side (c) 
views. Results for a full pot is illustrated in Figure 11 where we overlap input 
samples and scaffold nodes; this pot has a complex symmetry structure due to a 
neck, two handles, a hole at the bottom and input samples on part of the internal 
surfaces. Our use of skeletons in this project aims to single out the curves along 
the shock scaffold as a representation of the pot sherd. The latter can then be 
matched to other sherds, i.e., used in the stitching of sherds to ultimately achieve 
automatic reconstruction of the full pot [13]. 

In conclusion, we have presented an approach to the recovery and represen- 
tation of the 3D Medial Axis based on the notion of a hierarchically organized 
shock scaffold and presented a specific method for extracting the shock scaf- 







226 



F.F. Leymarie and B.B. Kimia 




Figure 11. Point samples (51 000) of a full pot (a) give rise to the shock scaffold nodes 
in (b). Source samples are also shown. 



fold of a cloud of points. The approach combines computational geometry and 
propagation-based methods, is exact, efficient, applicable to unorganized points 
[15] and surface patches [16], and results in a graph structure which can be used 
in recognition and matching applications [17]. 
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Abstract. We present an algorithm that, starting from the surface 
skeleton of a 3D solid object, computes the curve skeleton. The algo- 
rithm is based on the detection of curves and junctions in the surface 
skeleton. It can be applied to any surface skeleton, including the case in 
which the surface skeleton is two- voxel thick. 



1 Introduction 

Reducing discrete structures to lower dimensions is desirable when dealing with 
volume images. This can be done by skeletonization. The result of skeletonization 
of a 3D object is either a set of surfaces and curves, or, if even more compression is 
desired and the starting object is a solid object, a set of only curves. In the latter 
case, the curve skeleton of the object is obtained. The curve skeleton is a ID set 
centred within the object and with the same topological properties. Although 
the original object cannot be recovered starting from its curve skeleton, this is 
useful to achieve a qualitative shape representation of the object with reduced 
dimensionality. 

There are two different approaches to compute the curve skeleton of a 3D 
object. One approach is to directly reduce the 3D object to its curve skeleton. 
See for example jl|5|7j . Another approach is to first obtain a surface skeleton 
from the 3D object. Thereafter, the curve skeleton can be computed from the 
surface skeleton. See for example For both approaches, maintenance of 

the topology is not too hard to fulfil as topology preserving removal operations 
are available. The more crucial point is the detection of the end-points, i.e., the 
voxels delimiting peripheral branches in the curve skeleton. These voxels could 
in fact be removed without altering topology, but their removal would cause 
unwanted shortening and thereby important shape information would be lost. 
Different end-point detection criteria can be used. Criteria based on the number 
(and possibly position) of neighbouring voxels are blind in the sense that it is not 
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known a priori which end-points (and, hence, which branches) the curve skeleton 
will have. In fact, local configurations of object voxels, initially identical to each 
other, may evolve differently, due to the order in which voxels are checked for 
removal. Thus, the end-point detection criterion is sometimes fulfilled and some- 
times not for identical configurations. Preferable end-point detection criteria are 
based on geometrical properties, i.e., end-points are detected in correspondence 
with convexities on the border of the object, or with centres of maximal balls. As 
far as we know, only blind criteria were used, when the first approach (object — >■ 
curve skeleton) was followed. Therefore, we regard the second approach (object 
— >■ surface skeleton — >■ curve skeleton) as preferable, especially when a distance 
transform based algorithm is used in the object — >■ surface skeleton phase. In fact, 
in this case the surface skeleton can include all the centres of maximal balls, j0|. 
This guarantees that end-points delimiting curves in the surface skeleton are au- 
tomatically kept. The problem of not removing those end-points and correctly 
identifying other end-points during the surface skeleton — >■ curve skeleton phase 
still needs to be carefully handled. 

We present an algorithm to compute the curve skeleton from the surface 
skeleton and use geometrical information to ascribe to the curve skeleton voxels 
placed in curves and in (some of the) junctions, including voxels that will play 
the role of end-points in the curve skeleton. 

A surface skeleton consists, in most cases, of surfaces and curves crossing 
each other. The basic idea behind our algorithm is to detect the curves and 
the junctions between different surfaces and prevent their removal. This would 
automatically prevent unwanted shortening of curves and junctions without need 
of any end-point detection criterion. Indeed, we start from an initial classification 
of voxels in the surface skeleton, [S|. We distinguish junction, inner, edge, and 
curve voxels, see FiglU The border of the surface skeleton is the set including 
both edge and curve voxels. All curve voxels should be ascribed to the curve 
skeleton. Junctions shown in FigUl are placed in the innermost regions of the 
surface and should also be ascribed to the curve skeleton. However, junctions 
are not always in the innermost part of the surface. In fact, junctions may group 
in such a way that they delimit a surface in the surface skeleton. See FigOl 
where all junctions in the surface skeleton are shown to the right. Only junctions 
that could be interpreted as peripheral branches in the set of junctions should 
be kept in the skeleton, while junctions grouped into loops should not, as this 
would prevent the curve skeleton to be obtained. (Note also that loops in the 
curve skeleton correspond to tunnels in the object, and no tunnels exist in the 
surface skeleton shown in Figll) 

Our algorithm computes the curve skeleton from the surface skeleton in two 
steps, both based on iterated edge voxel removal. During the first step, all curve 
and junction voxels found in the original surface skeleton are always prevented 
from being removed. During the second step, voxels initially classified as junction 
voxels are prevented from removal, only if they are now classified as curve voxels; 
all voxels detected as edge voxels during this step are possibly removed. (Note 
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Fig. 1. A surface, top, with its classification, bottom. 




Fig. 2. A surface skeleton, left, and its junction voxels, right. 



that also voxels that were classified as junction voxels during the classification 
done on the original surface skeleton are now possibly removed.) 

The algorithm outlined above can be applied after any surface skeletonization 
algorithm and results in a curve skeleton. Moreover, the classification that we 
use can deal also with two-voxel thick surfaces so that our curve skeletonization 
algorithm can also be applied after algorithms resulting in two- voxel thick surface 
skeletons. 

2 Notions 

We refer to bi- level images consisting of object and background. In particular, in 
this paper the object is a surface, e.g., the one resulting after a 3D solid object 
has been reduced to its surface skeleton. The 26-connectedness is chosen for the 
object and the 6-connectedness for the background. Any voxel v has three types 
of neighbours: face, edge, and point neighbours. 
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We will use two difTerent surface skeletonization algorithms for the examples 
in this paper. One is based on the metric, i.e., the 3D equivalent of the city- 
block metric, and was introduced in . We will call the resulting set D® surface 
skeleton. The other algorithm is based on the metric, i.e., the 3D equivalent 
of the chess board metric, and was introduced in m We will call the resulting 
set surface skeleton. 

Classification of the voxels in a surface was suggested in m- A voxel is clas- 
sified after investigating its 3 x 3 x 3 neighbourhood. Of course, that classification 
works for an “ideal” surface, i.e., a surface which is one-voxel thick everywhere. 
Complex cases consisting of surfaces crossing each other would not produce a 
consistent classification at junctions, whenever these are more than one-voxel 
thick, see FigO In Fig0 left, voxels where the two surfaces cross each other, 
shown in dark grey, are classified as junction voxels, while in FigO right, voxels 
where the two surfaces cross, marked by •, are classified as inner voxels. That 
classification also fails when applied to surface skeletons of 3D objects having 
regions whose thickness is an even number of voxels. These surface skeletons are 
in fact likely to be two-voxel thick, 0m. 




Fig. 3. Simple examples of junctions between surfaces. Edge voxels are shown in white, 
inner voxels in grey, and junction voxels in dark grey for the classification introduced 
in m- Voxels marked by • should be classified as junction voxels to be consistent. 



In this paper, we use the classification introduced and thoroughly described 
in jSj . There some criteria suggested in m are used in combination with other 
criteria, where a slightly larger neighbourhood of each voxel is taken into account. 
Two-voxel thick regions are singled out with a linear four-voxel configuration 
(4x1x1, 1x4x1, 1x1x4), which identifies portions of the surface skeletons 
being exactly two-voxel thick in any of the a;, j/, z-directions. The classification 
requires a number of different criteria and the same voxels are likely to be checked 
against many criteria before they are eventually classified. It is then not possible 
to summarize it here. A more detailed description is given in a recently submitted 
paper 0. 
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3 Curve Skeletonization by Junction Detection 

The different classes of voxels in the surface skeleton we are interested in are 
junction, inner, edge, and curve voxels, see Figm The curve skeleton is obtained 
in two steps by an iterative algorithm. Each iteration of both steps includes two 
subiterations dealing with i) detection of edge voxels and ii) voxel removal by 
means of topology preserving removal operations, respectively. The two steps 
differ from each other for the selection of the voxels that, at each iteration, are 
checked to identify the edge voxels, i.e., the set of voxels candidate for removal. 

In the first step, only voxels initially (i.e., on the original surface skeleton) 
classified as inner (or edge) voxels are checked during the identification of the 
edge voxels. Voxels are actually interpreted as edge voxels, if their neighbourhood 
has been suitably modified due to removal of some neighbouring voxels. Note that 
we do not need any end-point detection criterion because (curve and) junction 
voxels are never checked to establish whether they have become edge voxels, 
iteration after iteration. An undesirable branch shortening would be obtained if 
also the voxels initially classified as junction voxels were checked. In fact, voxels 
placed on the tips of junctions, i.e., the junction voxels that should play the role 
of end-points, could be classified as edge voxels and, as such, could be removed. 

In the second step, also voxels initially classified as junction voxels are checked 
during the identification of the edge voxels. Edge voxels that have been trans- 
formed into curves during the first step are not interpreted as edge voxels and, 
hence, are automatically preserved from removal. The remaining voxels initially 
classified as junction voxels can be now interpreted as edge voxels, if their neigh- 
bourhood has been suitably modified. 

In both steps, on the current set of edge voxels, removal is done unless voxels 
are necessary for topology preservation. Standard topology preserving removal 
operations, e.g., those described in d, are sequentially applied. After each 
subiteration of removal of edge voxels, a new iteration starts and a new set of 
edge voxels is determined. Removal operations are then applied on the new set 
of edge voxels. Edge detection and voxel removal are iterated until no more edge 
voxels are identified and possibly removed. 

If the set of junctions of the surface skeleton has only peripheral junc- 
tions, i.e., no junctions are grouped into loops, the curve skeleton is obtained 
directly after the first step. It consists of the initial curve and junction voxels, as 
well as voxels necessary for connectedness. Otherwise, also the second step is nec- 
essary. In this case, the effect of the first step is to cause some junctions initially 
forming loops (see FigI2|) to become edge voxels. This allows skeletonization to 
continue towards voxels in the innermost part of the surface skeleton. 

Our algorithm is first illustrated on the ZJ® surface skeleton of a simple object, 
a cube, for which the first step is enough to compute the curve skeleton, FigEl 
The U® surface skeleton of the cube is shown in the middle. The resulting curve 
skeleton, shown to the right, coincides with the set of junction voxels. 

A slight ly more complex case, the U® surface skeleton of a box, is shown 
in Fig |5(b)| The set resulting at completion of the first step of our algorithm 
is shown in Fig|5(c)| Voxels detected as junction voxels during the initial clas- 



234 I. Nystrom, G. Sanniti di Baja, and S. Svensson 




Fig. 4. A cube with its -D® surface skeleton and the curve skeleton computed by our 
algorithm. 




(c) 



(d) 



Fig. 5. A box with its D® surface skeleton, top. The intermediate result and the final 
result of the curve skeletonization algorithm, bottom. 



sification are partly transformed into curve voxels, and partly into edge voxels 
surrounding the rectangular surface found in the middle of the box. The curve 
skeleton is the set resulting after the second step of our algorithm, see Fig p(d)| 

The box in Fig |5(a)| is of size 60 x 40 x 20 voxels, i.e., it has an even number 
of voxels in every direction. The rectangular surface in the middle of the surface 
skeleton is hence two-voxel thick. Therefore, also the obtained curve skeleton is 
two-voxel thick in the central part. Final reduction to a one-voxel thick curve 
skeleton could be achieved by identifying tip of protrusions and iteratively re- 
moving voxels not necessary for topology preservation as was shown in |2j. 

For the sake of completeness, we point out that the surface skeleton could 
have been reduced to one-voxel thickness before extracting the curve skeleton. 
One reason why we prefer not to do so, is that the resulting curve skeleton could 
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be more than one-voxel thick anyway (the rectangular surface can be one-voxel 
thick in depth, but an even number of voxels in other directions, as in the case 
above). Also, we have found that if reduction to one- voxel thickness is postponed 
until the curve skeleton has been obtained, the risk of creating spurious branches 
in the curve skeleton is significantly reduced. 



4 Some Examples 

Projections of thin complex structures are hard to visualize in a descriptive way. 
We are showing the results of our algorithm on rather small synthetic objects. In 
Figs. 101 and 13 a pyramid rotated 45° with its U® and surface skeletons and 
the curve skeletons computed by our algorithm are shown. In Figs. 0 and 0 a 
cylinder with its U® and U^® surface skeletons and the curve skeletons computed 
by our algorithm are shown. 




Fig. 6. A pyramid rotated 45° with its D® surface skeleton and the curve skeleton 
computed by our algorithm. 




Fig. 7 . The same pyramid as in FigEI with its surface skeleton and the curve 
skeleton computed by our algorithm. 
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Fig. 8. A cylinder with its D® surface skeleton and the curve skeleton computed by 
our algorithm. 





Fig. 9. The same cylinder as in Fig0 with its surface skeleton and the curve 
skeleton computed by our algorithm. 



We remark that the curve skeletonization algorithm can be applied regardless 
of which algorithm has been used to compute the surface skeleton. In any case, 
the curve skeleton is a satisfactory shape descriptor. 

The curve skeleton of the surface skeleton of the cylinder, Figl3 right, has 
a number of peripheral branches besides those including voxels initially classified 
as junction voxels. This is due to the fact that the edges of the surfaces are 
characterized by convexities (angle less than 90°), which during iterated voxel 
removal are shrunk to curves. These curves are very short as they consist of one 
or two voxels only. However, once voxels have been classified as curve voxels they 
are ascribed to the skeleton and this causes further voxels to be prevented from 
removal during curve skeletonization. 



5 Conclusion 

In this paper, we have presented an algorithm that computes the curve skeleton 
of a 3D solid object starting from its surface skeleton. The algorithm is based 
on the detection of curve and junction voxels in the surface skeleton. One of 



Curve Skeletonization by Junction Detection 



237 



the advantages of this approach is its independence of the choice of surface 
skeletonization algorithm. This is not always the case with other algorithms. For 
example, the surface skeleton — > curve skeleton part of the algorithm presented 
in PI can only be computed when starting from a U® surface skeleton, to obtain a 
reasonable result. In fact, it includes a blind end-point detection criterion tailored 
specifically to the ZJ® case, which would not work nicely in other cases, e.g., for 
a surface skeleton. 

The computational cost of the curve skeletonization algorithm is quite high 
(a couple of minutes for complex real objects in images of size 128 x 128 x 128 
voxels). This is due to the non-optimized classification process that has to be 
repeatedly used during curve skeletonization. 

We have tested our algorithm on a large number of surface skeletons, one- 
voxel and two-voxel thick, and have in all cases obtained satisfactory results. 

Acknowledgement. We are thankful to Prof. Gunilla Borgefors, Centre for 
Image Analysis, Uppsala, Sweden, for useful discussions on skeletonization meth- 
ods. 
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Abstract. Exact mathematical representations of objects are not suitable for 
applications where object descriptions are vague or object data is imprecise or 
inadequate. This paper presents representation schemes for basic inexact 
geometric entities and their relationships based on fuzzy logic. The aim is to 
provide a foundation framework for the development of fuzzy geometric 
modelling which will be useful for both creative design and computer vision 
applications. 



1 Introduction 

The success of object recognition depends very much on how an object is represented 
and processed. In many cases, an exact low-level geometric representation for the 
object such as edges and vertices, or control vertices for parametric surfaces, or CSG 
(Constructive Solid Geometry) primitives might not be possible to obtain. There 
might not be sufficient information about the objects because they were not 
previously known, or because the image data is too noisy. A representation scheme 
for fuzzy shapes, if exists, would be able to reflect more faithfully the characteristics 
of data and provide more accurate object recognition. Similarly, commercial CAD 
packages require designers to specify object shapes in exact low-level geometric 
representations. The current practice is for designers to manually sketch many 
alternative designs before a final design is chosen and a detailed model is constructed 
from it, using a CAD package. The necessity to work with exact object 
representations is counter-intuitive to the way designers work. What needed is an 
intuitive and flexible way to provide designers with an initial rough model by 
specifying some fuzzy criteria which are more in tune with the fuzziness in human 
thought process and subjective perception. This need for fuzziness also arises from 
our inability to acquire and process adequate information about a complex system. 
For example, it is difficult to extract exact relationships between what humans have in 
mind for objects’ shape and what geometric techniques can offer due to the 
complexity of rules and underlying principles, viewed from both perceptual and 
technical perspectives. In addition, designers often start a new design by modifying 
an existing one, hence it would also be advantageous to have a library of fuzzy 
objects which can be specified and retrieved based on fuzzy specifications. 

Another problem that would benefit from a fuzzy representation of shapes is how 
to overcome the lack of robustness in solid modeling systems. Although objects in 
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these systems are represented by ideal mathematical representations, their coordinates 
are represented approximately in a computer in floating point arithmetic which only 
has finite precision. The inaccuracy arisen from rounding off errors causes ill- 
conditioned geometric problems. For example, gaps or inappropriate intersections 
may occur and result in topological violations. 

The fuzziness in shape may therefore be categorised into two main types: 
ambiguity or vagueness arising from the uncertainty in descriptive language; and 
imprecision or inaccuracy arising from the uncertainty in measurement or calculation. 
Although fuzzy logic has been used extensively in many areas, especially in social 
sciences and engineering (e.g. [9,12,13]), fewer attempts have been made to apply 
fuzzy logic to shape representation, modelling and recognition. Rosenfeld and Pal [8, 
13] discussed how to represent a digital image region by a fuzzy set and how to 
compute some geometric properties and measurements for a region which are 
commonly used in computer vision (e.g. connectedness, convexity, area, 
compactness). Various fuzzy membership functions and index of fuzziness were also 
introduced to deal with uncertainty in image enhancement, edge detection, 
segmentation and shape matching (e.g. Pal and Majumder [7], Fluntsberger [5]). To 
address the problem of lack of robustness in solid modeling, a few methods have been 
presented. For example. Barker [2] defined a fuzzy discrimination function to 
classify points in space. Flu et al. [4] introduced a comprehensive scheme to use 
interval arithmetic as a basis for an alternative geometric representation to allow some 
notion of fuzziness. 

In previous papers, we analysed the needs for fuzziness in computer-aided design 
[9,10] and presented a scheme for shape specification using fuzzy logic in order to 
realise some aesthetic intents of designers using fuzzy logic [11]. The intention is to 
bridge the gap between the impreciseness of artistic interpretation and the preciseness 
of mathematical representations of shapes. We also constructed a database of fuzzy 
shapes based on superquadric representations and investigated appropriate retrieval 
schemes for these fuzzy shapes [14]. For some applications, fuzzy systems often 
perform better than traditional systems because of their capability to deal with non- 
linearity and uncertainty. One reason is that while traditional systems make precise 
decisions at every stage, fuzzy systems retain the information about uncertainty as 
long as possible and only draw a crisp decision at the last stage. Another advantage 
is that linguistic rules, when used in fuzzy systems, would not only make tools more 
intuitive, but also provide better understanding and appreciation of the outcomes. 

This paper deals with theoretical aspects of fuzzy geometry, in particular, how to 
represent basic fuzzy geometric entities and their relationships using fuzzy logic, and 
how to perform operations on such entities. These entities cover fuzzy points, lines, 
curves, polygons, regions and their 3D counterparts. The aim is to provide a unified 
foundation framework for the development of fuzzy geometric modelling which will 
benefit both creative design and computer vision applications. 



2 Fuzzy Geometric Representations 

This section explores some fundamental concepts of fuzzy geometric entities, how 
they can be defined in terms of fuzzy sets and how they are related to exact geometry. 
Although there are a number of different ways to introduce fuzziness into exact 
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geometry, the representations chosen here follow a unified approach which allows a 
fuzzy entity to be intuitively visualized, constructed and extended to other entities in 
the shape hierarchy. It is assumed that readers have some basic knowledge of fuzzy 
set and fuzzy system design which can be found in many text books (e.g. [3,6]). 



2.1 Fuzzy Points 

In classical geometry, the coordinates of a point defines its position in space and is 
expressed as an ordered set of real numbers. This definition implies the uniqueness of 
the point. However, if the position of the point is uncertain (e.g. vagueness in 
position specification), or imprecise (e.g. due to round off error in the calculation of 
its coordinates), then this definition is no longer valid and cannot be used for 
computational purposes. A fuzzy point is introduced to capture the notion of 
vagueness and impreciseness, and at the same time, overcome the inflexibility and 
inaccuracy arisen if a crisp point is used. 

A fuzzy point is defined as a fuzzy set P such that P = {{p,jUp{p )} , where 
Pp (p) is the membership function which can take value between 0 and 1. A crisp 

point is a special case where Pp (p) has the value 1 if p eP and 0 if p . 
Figure 1 displays an example of the membership function of a fuzzy point with 
respect to x-coordinate. To simplify the explanation and illustration of these new 
fuzzy geometrical elements, we assume a symmetrical membership function around 
the region of plausibility. However, all these concepts are also applicable to the cases 
where membership functions are unsymmetrical. 

A fuzzy point therefore is represented as a set of points defined within the support 
of this membership function, where each of these points has a different degree of 
possibility of belonging to this set. The corresponding support is a circle for a 2D 
point and a sphere for a 3D point respectively (Figure 2). Thus, a fuzzy point may 
be viewed as an extension of an interval point defined by Hu et al. [4]. In the latter 
case, there is an equal possibility that any point within an interval be the point of 
interest while in the former case, such possibility may be different for each point 
within the support of the membership function. 



2.2 Incidence of Fuzzy Points 

Two fuzzy points P and Q are incident if there exists a point which is common to 
both of them. In other words, if there exists a point lying within both the supports of 
these two fuzzy membership functions. The membership value for this incident point 
is the minimum of its membership values in these two fuzzy sets. Thus, the notion of 
weak and strong incidence may be introduced based on the membership value. This 
notion could be useful for decision making tasks in computer vision and computer- 
aided design. The incidence is symmetric but not transitive because the existence of a 
common point in fuzzy sets P and Q , and a common point in fuzzy sets Q and 
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R does not necessarily imply that there exists a common point in fuzzy sets P and 

R. 

Ppip) 




Fig. 1. The fuzzy membership function of a fuzzy point Fig. 2. A fuzzy point 




2.3 Fuzzy Connectedness of Two Points 

To identify points that belong to the same object (e.g. fuzzy line segment, curve, 
region or volume), it is necessary to define the concept of fuzzy connectedness. 

Given a fuzzy set of points F , Rosenfeld [10] defined the degree of fuzzy 
connectedness of two points p and q within F as 

C p {p, q) = max [min Pp (r)] , where the maximum is taken over all paths 
connecting these two points and the minimum is taken over all points r on each path. 

Two points p and q are said to be fuzzily connected in F if 

C F {p, p) > min[ Pp (p), Pf (<?)] ■ In other words, two points are connected in a 
fuzzy set of points if there exists a path between them which is composed of only 
points which also belong to this fuzzy set. This definition is also consistent with the 
concept of connectedness of two points within a crisp set whose membership values 
are all 1. 



2.4 Fuzzy Lines and Planes 

A fuzzy line PQ which connects two fuzzy points P and Q with membership 
functions Pp(p) and Pg{q) is defined as a fuzzy set each of whose members is a 
linear combination of a pair of points p and q with a membership function defined 
as ppQ{pq) = mm{pp{p) , Pq {q)) . 

A fuzzy line may be visualised as a centre line with variable thickness (Figure 3). 
This thin area of space (or thin volume of space for 3D case) bounds a family of crisp 




Representation of Fuzzy Shapes 243 



lines which are formed by pairs of endpoints belonging to the two fuzzy sets of 
endpoints. 

A fuzzy plane which is an extension of a fuzzy line is a thin planar shell with 
variable thickness. This shell encloses a family of crisp planes which is an extension 
of the family of crisp lines representing the fuzzy line. These concepts of fuzzy lines 
and planes encapsulate exact lines and exact planes as special cases. 




Fig. 3. A fuzzy point 




2.5 Fuzzy Polygons and Polyhedra 

A fuzzy polygon is composed of fuzzy vertices and fuzzy edges. Thus, it may be 
visualised as having edges of variable thickness, as described for fuzzy lines. This 
concept is readily extended to a fuzzy polyhedron for 3D case. 



2.6 Fuzzy Regions and Volumes 

Image segmentation or volume segmentation often result in fuzzy regions or volumes 
when the discrimination between texture characteristics is low or the quality of 
images is not good. In exact geometry, a 2D region is defined by the space bounded 
by a closed boundary. A fuzzy region is defined as the space bounded by a fuzzy 
boundary, where a fuzzy boundary is a fuzzy set of connected points in space, with 
the notion of connectedness being defined as above. Thus, a fuzzy region consists of 
two types of points: inner points and boundary points. In exact geometry, a point is 
an inner point if all points in its local 4 or 8-neighbourhood belong to the region 
while for a boundary point, there exists at least one point in its local neighbourhood 
that lies outside the region. In order to determine if a point is a boundary point or an 
inner point of a fuzzy region, we therefore need to consider the characteristics of the 
local neighbourhood of a point. 

A point is an inner point of a fuzzy region if all points in its local neighbourhood 
have membership values of 1. If at least one point in its neighbourhood has a 
membership value between (0,1), then the point is a boundary point. On the other 
hand, if points in its local neighbourhood either have membership values equal 0 or 
are boundary points, then the point is outside the region. These definitions cover the 
notion of inner, boundary and outer points of an exact region as special cases. 
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For 3D volumes, pixels are replaced by voxels, and a fuzzy boundary curve is 
replaced by a fuzzy boundary surface. The local 4 and 8 neighbourhood of a point are 
extended to local 6 and 14- neighbourhood and a scheme for classification of points 
can be readily defined in a similar way. 



2.7 Fuzzy Bezier and B-Splines Curves and Surfaces 

Fuzzy free-form curves and surfaces can be represented by fuzzy polynomial splines 
such as Bezier and B-splines which have fuzzy control points. Each of these control 
points may be visualized as a fuzzy set of points located within a circle (in 2D case) 
or a sphere (in 3D case). These points may have different membership values. A 
fuzzy spline curve is therefore represented by a family of curves lying within a thin 
tube and a fuzzy spline surface is represented by a family of surfaces lying within a 
thin shell. Figure 4 shows an example of a fuzzy Bezier curve. 

In solid modeling and CAD / CAM applications, fuzzy geometry serves the same 
purpose as interval geometry. Both representation schemes are useful for overcoming 
the problems of topological violation or gaps and inappropriate intersections caused 
by floating point arithmetic. However, one advantage of fuzzy geometry over interval 
geometry is that a notion of degree of possibility or plausibility that a curve or a 
surface in these families of curves and surfaces is the element of interest can be 
represented. This fact is also relevant in image processing where many tasks such as 
edge detection and segmentation often result in edges and contours whose degree of 
fuzziness depends on the variation of contrast in intensity or colour. 



2.8 Minimum Bounding Rectangle for a Fuzzy Shape 

The minimum bounding rectangle for a shape has been used as an approximation to 
the shape to speed up many tasks such as searching for an area of interest and 
checking the intersection or occlusion of objects. The true minimum bounding 
rectangle is defined as the smallest rectangle with its sides being parallel to the major 
axes of the shape. However, for many applications, it is more efficient and sufficient 
to use the minimum bounding rectangle whose sides are parallel to the coordinate 
axes. In the exact case, this rectangle is usually determined by the minimum and 
maximum coordinates of points belonging to the shape. For a fuzzy shape, we 
obtain a thin rectangular tube whose coordinates correspond to the extreme 
coordinates (leftmost, rightmost, topmost, bottommost) of the boundary points of the 
fuzzy shape. 



3 Operations on Fuzzy Geometry Entities 

In this section, we discuss how basic operations such as intersection, union, 
decomposition and de-fuzzification are performed on fuzzy geometry entities. 
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3.1 Intersection of Fuzzy Lines, Curves, Planes, and Surfaces 

The intersection of two fuzzy lines PQ and RS is a fuzzy point I which is 
represented hy a fuzzy set I = (/, jUj (/)) , where 

0 ')) ( 1 ) 

Figure 5 shows the intersection point of two 2D fuzzy lines represented by a fuzzy 
set of points which are located within a quadrilateral bounded by the limiting lines 
defining fuzzy lines. For 3D case, the domain of the fuzzy set that represents the 
fuzzy point of intersection is the intersection volume of two thin tubes representing 
two fuzzy lines. 




The intersection of two fuzzy curves in 2D or 3D is a fuzzy set defined in the same 
fashion. Similarly, we can extend these concepts to cover the intersection of a fuzzy 
line and a crisp plane, or of a fuzzy line and a fuzzy plane, or of two fuzzy planes, or 
of two fuzzy surfaces. Thus, the intersection of these geometry entities can be 
performed as two separate tasks. The first task is to compute the intersection of pairs 
of crisp geometry entities (which belong to the two families of fuzzy entities) in the 
same way as in exact geometry. The second task is to compute the membership 
value for each resulting entity. 



3.2 Union, Intersection, and Adjacency of Fuzzy Regions and Volumes 

The union of two fuzzy regions and i® defined as a fuzzy subset U of 
common points U whose membership value for either of these regions is non-zero, 
where U =(u, JUjj (u)) and (w) =max (w) , (m)) . This formula for 

jUfj (u) is generally accepted for the union of two fuzzy sets. Flowever, if we wish 
to model more accurately the probability of U being a member of the new combined 
set, then it is more appropriate to use the following formula: 



;«£/ ( w ) = 1 - (1 - )(1 - yUfi, ) = Pr. 



( 2 ) 
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The intersection of two fuzzy regions and is defined as a fuzzy subset J of 
common points j whose membership values for both regions are non-zero, where 

(j)) and ( j) =min ( j) , (j)) (3) 

A common boundary of two fuzzy regions and is a special case of the 
intersection where the fuzzy subset of intersection do not include the inner points of 
either region. 

In exact geometry, two regions are adjacent if they have a common border. In the 
fuzzy case, two fuzzy regions are adjacent if they have a fuzzy common border. That 

means if there exists a fuzzy subset B = (b, jUg{b)) of common boundary points 
b whose membership values for these two regions are non-zero, where the 
membership function is calculated as jUg (b) = min (b) , (b)) . 

These concepts of intersection and adjacency can be readily extended to the case of 
fuzzy volumes if we use a voxel instead of a pixel to represent a point and replace 
boundary curves by boundary surfaces. 



3.3 Decomposition of a Fuzzy Shape 

A fuzzy shape may need to be decomposed into smaller shapes according to certain 
criteria and for specific purposes. The question is how to re-define the membership 
value for each point in these new shapes? To facilitate the processing or 
understanding of a shape, a common way is to reduce its complexity by subdividing it 
into subcomponents which have more simple shape (e.g. convex). This case is simple 
because it is reasonable to retain the membership value of a point in the original shape 
as its membership value for the new subcomponent it belongs to. The main reason is 
that the criterion used for splitting up the shape does not have any direct effects on 
the probability of membership. However, if the criterion for decomposition is more 
complex, for example, to obtain subregions with more homogeneous texture, then the 
membership value of each point needs to be recomputed based on how closely its 
textural characteristics resemble that of the new subregion. Thus, in general, after a 
decomposition of a fuzzy shape, it is necessary to examine if the decomposition has 
affected the membership and if so, how can it be re-computed. Methods for re- 
calculation of the membership values will depend on the context of each specific 
problem. 



3.4 Defuzzification of a Fuzzy Shape 

It is advantageous to retain the notion of fuzziness in shapes as long as possible 
during a computational or decision-making process in order to avoid early 
commitments of using the approximation of shapes at each stage because errors 
would accumulate as a result. However, in real life applications, there would come to 
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a stage where a crisp representation of a shape is required, e.g. an exact description of 
an object is required for manufacturing purposes. The question is how to derive a 
viable exact geometry entity from a fuzzy one? 

The most common defuzzification method is the centroid (or centre of gravity) of a 
fuzzy set A which is calculated as its weighted mean, where 



^ = (Z (^i ) ) / (Z )) 

i i 



(4) 



This definition is appropriate for a fuzzy point because the ‘balance’ point is 
obtained. For a fuzzy line, there are two ways to interpret this weighted mean. The 
first way is to defuzzify each fuzzy endpoint first, before joining them to obtain an 
exact line. The second way is to apply the weighted mean directly to the family of 
crisp lines which make up the fuzzy lines. Each of these crisp lines is represented by 

a tuple where ,C; are the slope and intercept of the line 

respectively and is the membership value of the line. Hence the weighted mean 

for the slope and intercept may be computed using the same formula to obtain a 
defuzzified line. 

The first method can be applied to defuzzify a polynomial spline if its control 
points are known. However, if the control points are not known, the defuzzified curve 
can be computed as the weighted medial axis, where the membership value of each 
point is used as a weight for the distance calculation in the medial axis transform. 
Due to page limit, details on this transform are not covered here, but they may be 
found in [1, page 252]. 

For the case of a fuzzy region, once its fuzzy boundary is identified (by excluding 
the inner points and outer points of the region as explained in the previous main 
section), the boundary can be defuzzified in a similar way to that for a general curve. 



4 Conclusion 

A set of complete representation schemes for fuzzy geometry entities and basic 
operations has been presented which is based on fuzzy logic. These schemes may be 
seen as the extensions of exact geometry representations. They have potential uses in 
many important applications in order to cater for the ambiguity of human reasoning 
and linguistic descriptions as well as the impreciseness of numerical computation. It 
is also envisaged that these representations will provide a foundation framework for 
fuzzy geometric modelling, where fuzziness may be found not only in geometry 
entities, but in their spatial relationships and other attributes. They would provide a 
systematic way to construct fuzzy object models for design, matching and 
recognition. However, an important issue that needs to be investigated is the trade-off 
between efficiency and the retention of fuzzy information. What type of fuzzy 
information should be incorporated in the model and at what stage should such 
information be defuzzified? Our related work on similarity measures and evolution 
of fuzzy shapes are currently in progress. 
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Abstract. Deformables templates, because they contain a priori knowl- 
edge information on searched shapes, can be useful in the segmentation 
of complex images including partially occluded objects. In this paper, we 
propose a generic deformable template for shapes that are built around 
a flexible symmetry axis. The shape model is based on a parametrical 
skeleton, the distance between this skeleton and the contour points of the 
shape being determined by a parametrical envelope function. Various ex- 
amples of skeleton and envelope models are given, and a general scheme 
for identification and matching suitable for a reusable code implemen- 
tation is proposed. An application to image segmentation of partially 
overlapping leaves in natural outdoor conditions is then presented. 



1 Introduction 

Because the vision process makes a projection of a 3D world into 2D data, image 
interpretation is basically an underconstrained problem: extrinsic information 
about the shape of the objects we are looking for will be often necessary to 
overcome ambiguities in image interpretation of complex scenes. Such extrinsic 
information about object shape is refered as a shape model. Many types of shape 
models have been proposed in the litterature can be classified in two 

main groups: 

— free- form models, or active contours: this type of model has first been in- 
troduced by Kass et al.|21. Free-form models just integrate local constraints 
such as smoothness (curvature limitation) and elasticity (length limitation). 
They can be represented by a set of contour sampling points 0, or under 
an analytical form, such as B-Splines |3|. 

— deformable templates: these models aim to introduce more strong and spe- 
cific shape constraints, based on an a priori knowledge on the objects of 
interest. They can be defined either analytically im), or by a prototype 
shape with associated deformation modes issued from a training set |S| . Be- 
cause they include strong information on shape, deformable templates have a 
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better ability to deal with specific vision problems such as partially occluded 
objects recovering. 

Recovering objects in an image with shape models is usually seen as an 
optimisation problem, either using a Bayesian approach (0E1), or using an en- 
ergy minimisation approach (0D- Though some authors solve this optimisation 
problem by stochastic methods, such as simulated annealing ||, methods based 
on gradient descent are often preferred, because they are much less time con- 
suming. 

A major problem when fitting shape models in images is that the initial state 
of the model must be close enough to the expected solution to get a correct 
convergence. In m, Cohen proposed a solution to this problem for free-form 
models, by introducing a “balloon force” that inflate the active contour until the 
image attracting features are reached. This method was proposed for cavity edges 
detection in medical imagery, and requires that no intermediary local minimum 
can be encountered during the inflating process which starts from inside the 
cavity. 

However, this kind of solution can be generalized to more complex images, 
under two conditions: 

— sufficiently specific shape constraints are necessary to avoid intermediary 
minima 

— adapted forces must be defined to make the model evolve in the image. 

This is the approach that we propose here, in the framework of leaf recog- 
nition in natural outdoor scenes. The shapes that we are looking for are char- 
acterized by their natural variability, and by partial overlapping situations. In 
counterpart, generic shape properties can be highlighted, and the high colour 
contrast between leaves and background allows to design efficient model evolu- 
tion forces. 

The paper is organized as following: in parts El and 0 we propose a generic 
type of deformable template, called skeleton-based model, which adresses shapes 
presenting a deformable symmetry axis. We also propose associated principles 
for identification and matching which allows reusable code implementation for 
various detailed shape definitions. The application to leaf segmentation is pre- 
sented in part 0 

2 Skeleton-Based Shape Models: Definition and 
Identification 

2.1 About Parametric Shape Model Identification 

We assume here that a shape in a 2D image can be described by a single closed 
contour. Under this restrictive condition, a parametric shape model can be con- 
sidered as a function: 




or in the discrete case: fp : 



{0, . . . , A - 1} ^ 

i (xi,yi) 
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where P = {po-, ■ ■ ■ ,pk-i} is a set of K parameters, and N is the number of 
sampling points representing the contour. 



Newton-Based Identification Procedure. The objective of the identifica- 
tion procedure is to determine the set of parameters P fitting the model to a 
given set of image points. First, it will allow us to verify if a given parametrical 
model is pertinent for the type of shape we are interested in. Then, as we will 
see later, it can be used in the iterative process of model matching in the image. 

Let us call X = {Mg, Mi,... a set of N image points on which 

we want to fit a parametrical model fp, and Fp = {/p(0),... , fp{N — 1)} 
the current state of the model. Identification can be obtained using the Newton 
algorithm to solve the equation: 

X-Fp = 0. (1) 

The Newton method to solve a multivariable equation Y (P) = 0 consists in 
applying to the variable P the iterative correction dP = —J^.Y, where is the 
pseudo-inverse of the Jacobian matrix of Y{P). In our case, it comes: 

dP = H^.{X - Fp). (2) 

where H is the N x K Jacobian matrix: 



H = 



dfpji) 

dpo 



dfp{i) 

dpK-i 



( 3 ) 



The correction dP given by (H will be applied to the set of parameters P 
until it is nearly null. Notice that if iJ is a constant, it can be easily shown that 
the solution is obtained in one iteration. 



2.2 Skeleton-Based Shape Models 

Definition. The kind of model that we introduce here adresses 2D shapes that 
present a native axial symmetry, but are subject to deformations of their sym- 
metry axis. Those kind of shapes can be encountered in manufactured objects 
(e.g. pipes). They are also particularly important in agricultural objects such as 
leaves, fruits, etc., where the symmetry axis is induced by the biological mech- 
anism of growth, but subject to random deformations. Therefore, we defined a 
generic skeleton-based model, illustrated in Fig. ^ which comprises: 

— a parametric skeleton (open curve), representing the symmetry axis of the 
shape 

— a parametric envelope, which is a scalar function giving the shape contour 
distance on each side of the skeleton. 
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In these conditions, each point i of the model is defined by: 

± fE{i) 



x{i) 

y{i) 



^ s (^s) 

ys{is) 



X]\[{is) 

yN{is) 



( 4 ) 



where: 



— the number of sample points of the skeleton is: Ns = N/2 + 1 

— is is the skeleton corresponding point index {is = i for i < N/2, and is = 
N — i otherwise) 

— {x, y) are the contour point coordinates 

— {xs,ys) are the skeleton point coordinates 

— /e is the envelope scalar function 

— (xN,yN) = {is) is the vector normal to the skeleton at the point {xs,ys) 




Fig. 1. Generic skeleton-based model 



As an example, a shape model made of a parabolic skeleton and a parabolic 
envelope could be defined by: 

Xs{is) ~ ^x^s 4” ^xis 4” ^x- (fi) 

ys{'^s) 4 “ 4 “ Cy. ( 6 ) 

fE{is) = b.ls.{l-ls). ( 7 ) 

where Is = is/{Ns — 1), and Ns is the number of skeleton samples. 



Identification. As shown in equation m, the identification procedure requires 
the Jacobian matrix H of the parametrical model. For a skeleton-based shape 
model as defined by , it comes: 



dfp{i) 

dPj 



= Hs± 



( dfE{is) 
V dpj 






fE{is) 



dl^{is) \ 

dpj J ■ 



( 8 ) 



where Hs is the skeleton model Jacobian matrix. Therefore, the identification 
just requires to know the derivatives of the envelope function and of the skeleton 
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normal vector, and the Jacobian matrix Hg. This result allows us to define, from 
a software design point of view, a generic skeleton-based model class, on which 
identification process will be applied regardless of the particular skeleton and 
envelope model chosen: required derivatives just have to be available as virtual 
member functions. 

Fig. 0 shows the object hierarchy that we have developed, with various 
derivated classes of skeletons and envelopes. 




Fig. 2. Object-oriented hierarchy of skeleton-based models (bold arrows show inheri- 
tance) 



3 Shape Model Matching 

3.1 Pressure Forces 

The more a shape model is strongly constrained, the more it will have the ability 
to bypass undesired local minima, and thus to start the matching process from 
an initial state far from the final target position, and even to manage object 
overlapping situations. However, for this purpose, attraction forces directly de- 
rived from an image energy function are not sufficient: evolution forces allowing 
to control the model modification process, and thus depending on the current 




Fig. 3. Identification examples with a parabolic skeleton and various envelope models 
{from left to right: parabolic, pear-shaped and elliptic envelopes) 



shape of the model itself are required. We will call them “pressure forces”, by 
extension of the previously ones defined in m- As an example, in order to match 
the model on a binary shape (white on black), we can define a pressure force 
which aims to make the model expand or retract itself according to the value 
of the pixels under the model. Thus, for each point fp{i) of the model contour, 
this force will have: 

— the direction of the vector normal to the contour at this point 

— a positive intensity if the pixel at this point is white, negative otherwise (Fig. 

0 . 

As we will see in part 0 such pressure forces allow to start the model from 
a very small portion of the object candidate, as soon as the principal axis of the 
initial model is correctly chosen. 




Fig. 4. Pressure forces for a binary image 



3.2 Iterative Procedure 

Once the evolution forces are defined, they have to be applied iteratively in order 
to modify the shape model parameters until the final position is reached. This 
is made by the following procedure: 
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— pressure forces are transduced in model-free elementary displacements 
AXi = k.Fi (where A: is a constant) 

— these elementary displacements define a new target contour. 

— an identification is made on this target contour, as described in part |2| 
Notice that in this case, only one iteration of equation © is used in the 
identification step, thanks to a very close target contour. 

4 Application to Weed Leaf Segmentation 

The application that we present here is concerned with the recognition and the 
characterisation of weed populations in crops by computer vision, for agronomi- 
cal purposes (optimisation of herbicide usage) . In this framework, shape models 
have been considered in order to overcome the image segmentation problems, 
which are mainly due to the biological variability of plants and overlapping 
leaves situations. 

We present here the first results that have been obtained with skeleton- 
based models for a very common weed variety in maize crop, called green foxtail 
{Setaria viridis). The model that has been chosen for this variety is based on a 
parabolic skeleton and a parabolic envelope, as defined in 12.21 



4.1 Definition of the Pressure Forces 

As described in rm pressure forces are normal to the model contour. Pressure 
force intensities are computed from color information in the images. As a first 
step, the mean value /i and the covariance matrix C of the RGB values of leaf 
pixels have been determined. It allows us to calculate, for each pixel of the image 
with a color value x = (i?, G, B), the Mahalanobis distance d from the plant color 
class by: 



d'^ = {x — fj,)^ C ^{x — fi). ( 9 ) 

Then a threshold value s is applied on the square distance d^ to perform a 
binary segmentation, that will be used for the model initialisation. However, the 
intensities of the pressure forces are not simple binary values as suggested in Id. II 
They are given in the neighbourhood of s by the relation: Fi = s — df (see Fig. 

EJ. 

4.2 Model Initialisation 

Models are initialised by searching for the leaf tips in the binary image issued 
from color segmentation. Whenever a leaf tip is found (i.e. a small leaf-type win- 
dow surrounded by a majority of background pixels), the best model orientation 
is determined by rotating a small elliptic model around the centre of the window, 
and keeping the most included position in vegetation (see Fig. 0- 



256 



G. Rabatel et al. 




Fig. 5. Intensity of color pressure forces 




Fig. 6. Model initialisation procedure 



4.3 Model Evolution 



The evolution process has been described in 13. 21 However, experimentations have 
shown that there still exists some undesirable deformations: the sample points of 
the model can cluster on some parts of the contour, creating irregular behaviours 
which have to be controlled. In nn, Berger proposed to resolve this problem for 
snakes by adding a term in their internal energy, which attempts to maintain 
equidistance between the points along the curve. In our case, we applied this 
supplementary constraint to the skeleton of the model, by adding corresponding 
forces on each contour point. An example of model evolution is given in Fig. 0 




Iteration 0 Iteration 20 Iteration 110 



Fig. 7. Model evolution on a weed leaf 
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The termination criterion is based on the skeleton speed of evolution, which 
decreases when the model approaches its final position. 



4.4 Results 

Forty images of green foxtail have been processed, representing about 600 leaves. 
These images were acquired using a photographic flash, in order to minimise the 
effect of the variability of natural outdoor lighting conditions. For each image, 
models have been initialised on every leaf tips found and then been expanded as 
described above. Because in some cases, several tips can be detected for a same 
leaf, redundant resulting models have then been searched and removed. 

Under these conditions, 83.6% of the leaves were detected. Leaves that are 
not detected (16.4%) correspond to occluded tips. In order to measure the ad- 
justment quality of models in their final position, we defined a criterion based on 
the similarity, for each point of the model, between the direction of the gradient 
of colorimetric distance d and the normal to the model: this allows to check if 
the model contour is actually aligned on a color edge. The quality criterion is 
then the percentage of points of the model verifying this similarity, defined by a 
maximum angle 6. A value of 40° has been set as the best value. 

Figure!^ shows an example of detected leaves and the corresponding quality 
criterion values. We can see that well-adapted models have a high criterion value. 
Much lower criterion values (less than 80%) correspond not only to erroneous 
models, but also to well-adapted models, that are placed on partially occluded 
leaves. As a consequence, if the quality criterion is sufficiently high, models 
are assumed to be well-adapted. Otherwise, the quality criterion is not reliable 
enough to draw any conclusion on the adjustment. 

In our case, 34.3% of the resulting models have a quality criterion up to 
80%. For those models, we can assume that they are well-adjusted on leaves. 
Concerning the others, as the quality criterion is not reliable enough, the next 
step will consist in carrying on with the analysis of the model. In particular, 
the study of the spatial relative position between each model should help to get 
information on the adjustment of the models, as it would be possible to interpret 
models not individually but at a plant scale. 

5 Conclusion 

We have introduced here a particular type of deformable shape model, based on 
a parametrical symmetry axis and a parametrical envelope, and shown that this 
concept was flexible enough to easily support various specific implementations 
for various shape modelisations, with the same basic procedures. Because this 
kind of models contain strong internal constraints, they can be fitted in images 
starting from far initial positions, using non-gradient evolution forces. An appli- 
cation to the segmentation of partially occluded vegetation leaves detection has 
been presented. Results show that a satisfactory rate of leaves can be correctly 
detected. However, in this particular case, the quality criterion that has been 
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model 



Occluded 



tip 



Fig. 8. Example of results 



defined is not reliable enough to isolate erroneous detections. Further studies 
are necessary to overcome this problem by analysing relative position of leaves 
with respect to the possible plant structures. 
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Abstract. This paper presents a geometric measure that can be used 
to gauge the similarity of 2D shapes by comparing their skeletons. The 
measure is defined to be the rate of change of boundary length with 
distance along the skeleton. We demonstrate that this measure varies 
continuously when the shape undergoes deformations. Moreover, we show 
that ligatures are associated with low values of the shape-measure. The 
measure provides a natural way of overcoming a number of problems 
associated with the structural representation of skeletons. The first of 
these is that it allows us to distinguish between perceptually distinct 
shapes whose skeletons are ambiguous. Second, it allows us to distinguish 
between the main skeletal structure and its ligatures, which may be the 
result of local shape irregularities or noise. 



1 Introduction 

The skeletal abstraction of 2D and 3D objects has proved to be an alluring yet 
highly elusive goal for over 30 years in shape analysis. The topic is not only 
important in image analysis, where it has stimulated a number of important 
developments including the medial axis transform and iterative morphological 
thinning operators, but is also an important field of investigation in differential 
geometry and biometrics where it has lead to the study of the so-called Blum 
skeleton ^ . Because of this, the quest for reliable and efficient ways of computing 
skeletal shape descriptors has been a topic of sustained activity. Recently, there 
has been a renewed research interest in the topic which as been aimed at deriving 
a richer description of the differential structure of the object boundary. This 
literature has focused on the so-called shock-structure of the reaction-diffusion 
equation for object boundaries. 

The idea of characterising boundary shape using the differential singularities 
of the reaction equation was first introduced into the computer vision litera- 
ture by Kimia Tannenbaum and Zucker jSj . The idea is to evolve the boundary 
of an object to a canonical skeletal form using the reaction-diffusion equation. 
The skeleton represents the singularities in the curve evolution, where inward 
moving boundaries collide. The reaction component of the boundary motion 
corresponds to morphological erosion of the boundary, while the diffusion com- 
ponent introduces curvature dependent boundary smoothing. In practice, the 
skeleton can be computed in a number of ways HEH. Recently, Siddiqi, Tan- 
nenbaum and Zucker have shown how the eikonal equation which underpins the 
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reaction-diffusion analysis can be solved using the Hamilton- Jacobi formalism of 
classical mechanics IHIT7I . 

One of the criticisms that can be levelled at existing skeletonisation methods 
is their sensitivity to small boundary deformations or ligatures. Although these 
can be reduced via curvature dependent smoothing, they may have a significant 
effect on the topology of the extracted skeleton. 

Once the skeletal representation is to hand then shapes may be matched by 
comparing their skeletons. Most of the work reported in the literature adopts 
a structural approach to the matching problem. For instance, Pelillo, Siddiqi 
and Zucker use a sub-tree matching method m This method is potentially 
vulnerable to structural variations or errors due to local deformations, ligature 
instabilities or other boundary noise. Tithapura, Kimia and Klein have a po- 
tentially more robust method which matches by minimising graph-edit distance 

ISESI- 

One of the criticisms of these structural matching methods is that percep- 
tually distinct shapes may have topologically identical skeletons which can not 
be distinguished from one-another. Moreover, small boundary deformations may 
significantly distort the topology of the skeleton. 

We draw two observations from this review of the related literature. The first 
is that the existing methods for matching are based on largely structural repre- 
sentations of the skeleton. As a result, shapes which are perceptually different but 
which give rise to the same skeleton topology are ambiguous with one-another. 
For this reason we would like to develop a metrical representation which can 
be used to assess the differences in shape for objects which have topologically 
identical skeletons. Secondly, we would also like to be able to make compar- 
isons between shapes that are perceptually close, but whose skeletons exhibit 
topological differences due to small but critical local shape deformations. 

To meet these dual goals, our shape-measure must have three properties. 
First, it must be continuous over local regions in shape-space in which there are 
no topological transitions. If this is the case then it can be used to differentiate 
shapes with topologically identical skeletons. Secondly, it must vary smoothly 
across topological transitions. This is perhaps the most important property since 
it allows us to define distances across transitions in skeleton topology. In other 
words, we can traverse the skeleton without encountering singularities. Thirdly, 
it must distinguish between the principal component of the skeleton and its 
ligatures 0. This will allow us to suppress instabilities due to local shape defor- 
mations. 

Commencing from these observations, we opt to use a shape-measure based 
on the rate of change of boundary length with distance along the skeleton. To 
compute the measure we construct the osculating circle to the two nearest bound- 
ary points at each location on the skeleton. The rate of change of boundary length 
with distance along the skeleton is computed by taking neighbouring points on 
the skeleton. The corresponding change in boundary length is computed by de- 
termining distance along the boundary between the corresponding points of con- 
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tact for the two osculating circles. The boundary distances are averaged for the 
boundary segments either side of the skeleton. 

This measurement has previously been used in the literature to express rel- 
evance of a branch when extracting or pruning the skeleton irm . We show 
that rate of change of boundary length with distance along the skeleton has a 
number of interesting properties. The consequence of these properties is that 
the descriptive content of the measure extend beyond simple feature saliency, 
and can be used to attribute the relational structure of the skeleton to achieve a 
richer description of shape. Furthermore, we demonstrate that there is an inti- 
mate relationship between the shape measure and the divergence of the distance 
map. This is an important observation since the divergence plays an central role 
when the skeleton is computed using the Hamilton- Jacobi formalism to solve the 
eikonal equation. 



2 Skeleton Detection 

A great number of papers have been written on the subject of skeleton detection. 
The problem is a tricky one because it is based on the detection of singularities 
on the evolution of the eikonal equation on the boundary of the shape. 

The eikonal equation is a partial differential equation that governs the motion 
of a wave-front through a medium. In the case of a uniform medium the equation 
is ^C{t) = aN{t), where C{t) : [0,s] — >■ is the equation of the front at 

time t and N{t) : [0,s] — >■ is the equation of the normal to the wave front 

in the direction of motion and a is the propagation speed. As the wave front 
evolves, opposing segments of the wave-front collide, generating a singularity. 
This singularity is called a shock and the set of all such shocks is the skeleton 
of the boundary defined by the original curve. This realisation of the eikonal 
equation is also referred to as the reaction equation. 

To detect the singularities in the eikonal equation we use the Hamilton- Jacobi 
approach presented by Siddiqi, Tannenbaum, and Zucker jUTTj . Here we review 
this approach. 

We commence by defining a distance-map that assigns to each point on the 
interior of an object the closest distance D from the point to the boundary (i.e. 
the distance to the closest point on the object boundary). The gradient of this 
distance-map defines a field F whose domain is the interior of the shape. The 
field is defined to be F = VU, where V = (^, is the gradient operator. 
The trajectory followed by each boundary point under the eikonal equation can 
be described by the ordinary differential equation x = F{x), where x is the 
coordinate vector of the point. This is a Hamiltonian system, i.e. wherever the 
trajectory is defined the divergence of the field is zero m However, the total 
inward flux through the whole shape is non zero. In fact, the flux is proportional 
to the length of the boundary. 

The divergence theorem states that the integral of the divergence of a vector- 
field over an area is equal to the flux of the vector field over the enclosing 
boundary of that area. In our case, f^V-Fda = F ■ ndl = <Pa{F), where A 
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is any area, F is a field defined in A, da is the area differential in A, dl is the 
length differential on the border L of A, and <!>a{F) is the outward ffux of F 
through the border L. 

By virtue of the divergence theorem we have that, within the interior, there 
are points where the system is not conservative. The non-conservative points 
are those where the boundary trajectory is not well defined, i.e. where there 
are singularities in the evolution of the boundary. These points are the so-called 
shocks or skeleton of the shape-boundary. Shocks are thus characterised by lo- 
cations where W ■ F < 0. Unfortunately, skeletal points are, also, ridges of the 
distance map D, that is F = VD is not uniquely defined in those points, but 
have different values on opposite sides of the watershed. This means that the 
calculation the derivatives of F gives raise to numerical instabilities. To avoid 
this problem we can use the divergence theorem again. We approximate the 
divergence with the outward ffux through a small area surrounding the point. 
That is V • F{x) ~ <l>u{F){x), where U is a small area containing x. Thus, 
calculating the ffux through the immediate neighbors of each pixel we obtain a 
suitable approximation of V • F{x). 



2.1 Locating the Skeleton 

The thinning of the points enclosed within the boundary to extract the skeleton 
is an iterative process which involves eliminating points with low inward ffux. 
The steps in the thinning and localisation of the skeleton are as follows 

~ At each iteration of the thinning process we have a set of points that are 
candidates for elimination. We remove from this set the point with the lowest 
inward ffux. 

— Next and we check whether the point is topologically simple, i.e. whether it 
can be eliminated without splitting the remaining point-set. 

— If the point is not simple, then it must be part of the skeleton. Thus we 
retain it. 

— If the point is simple, then we check whether it is an endpoint. If the point 
is simple and not an endpoint, then we eliminate it from the image. If this is 
the case then we add to the candidate set the points in its 8-neighborhood 
that are still part of the thinned shape (i.e. points that were not previously 
eliminated) . 

— If a simple point is also an endpoint, then the decision of whether or not it 
will be eliminated is based on the inward ffux value. If the ffux value is below 
a certain threshold we eliminate the point in the manner described above. 
Otherwise we retain the point as part of the skeleton. 

We initialise this iterative process by placing every boundary point in the can- 
didate set. We iterate the process until we have no more candidates for removal. 
The residual points will all belong to the skeleton. 
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3 The Shape-Measure and Its Properties 

When the skeleton is computed in this way, then the eikonal equation induces a 
map from a point in the skeleton to a set of points on the boundary of the shape. 
That is, there is a correspondence between a point on the skeleton and the set of 
points on the boundary whose trajectories intercept it under the motion induced 
by the eikonal equation. The cardinality of this set of corresponding points on 
the boundary can be used to classify the local topology of the skeleton in the 
following manner 

— the cardinality is greater than or equal to 3 for junctions. 

— for endpoints the cardinality is number from 1 to a continuum. 

— for the general case of points on branches of the skeleton, the cardinality is 

exactly 2. 

As a result of this final property, any segment of a 
skeleton branch s is in correspondence with two bound- 
ary segments li and h- This allows us to assign to a 
portion of the skeleton the portion of the boundary 
from which it arose. For each internal point in a skele- 
ton branch, we can thus define the local ratio between 
the length of the generating boundary segment and the 
length of the generated skeleton segment The rate of change of boundary length 
with skeleton length is defined to be dl/ds = dli/ds + dl 2 /ds. This ratio is our 
measure of the relevance of a skeleton segment in the representation of the 2D 
shape-boundary. 

Our proposal in this paper is to use this ratio as a measure 
of the local relevance of the skeleton to the boundary-shape 
description. In particular we are interested in using the mea- 
sure to identify ligatures |2j. Ligatures are skeleton segments 
that link the logically separate components of a shape. They 
are characterised by a high negative curvature on the generat- 
ing boundary segment. The observation which motivates this 
proposal is that we can identify ligature by attaching to each 
infinitesimal segment of skeleton the length of the boundary 
that generated it. Under the eikonal equation, a boundary 
segment with high negative curvature produces a rarefaction 
front. This front will cause small segments to grow in length 
throughout their evolution, until they collide with another 
front and give rise to a so-called shock. This means that very short boundary 
segments generate very long skeleton branches. Consequently, when a skeleton 
branch is a ligature, then there is an associated decrease in the boundary-length 
to shock-length ratio. As a result our proposed skeletal shape measure ‘weights” 
ligature less than other points in the same skeleton branch. 

To better understand the rate of decrease of the boundary length with skeletal 
length, we investigate its relationship to the local geometry of the osculating 




Fig. 2. Ligature 
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circle to the object boundary. We have 



dli/ds 



cos 9 
1 — rki 



and, similarly, dl 2 /ds 



cos 9 
1 — rk2 



( 1 ) 



where r is the radius of the osculating circle, ki is the curvature of the mapped 
segment on the boundary, oriented so that positive curvatures imply the oscu- 
lating circle is in the interior of the shape, and, finally, 9 is the angle between 
the tangent to the skeleton and the tangent to the corresponding point on the 
boundary. These formulae show that the metric is inversely proportional to neg- 
ative curvature and radius. That is, if we fix a negative curvature k\, the measure 
decreases as the skeleton gets further away from the border. Furthermore, the 
measure decreases faster when the curvature becomes more negative. 

Another important property of the shape-measure is that its value varies 
smoothly across shape deformations, even when these deformations impose topo- 
logical transitions to the skeleton. To demonstrate this property we make use of 
the taxonomy of topological transition of the skeleton compiled by Giblin and 
Kimia jZ|. According to this taxonomy, a smooth deformation of the shape in- 
duces only two types of transition on the skeleton (plus their time reversals) . The 
transitions are branch contraction and branch splicing. A deformation contracts 
a branch joining two junctions when it moves the junctions together. Conversely, 
it splices a branch when it reduces in size, smoothes out, or otherwise eliminates 
the protrusion or sub-part of a shape that generates the branch. 

A deformation that contracts or splices a skeleton branch, causes the global 
value of the shape-measure along the branch to go to zero as the deformation 
approaches the topological transition. This means that a deceasing length of 
boundary generates the branch, until the branch disappears altogether. 

When a deformation causes a contraction transition, both the length of the 
skeleton branch and the length of the boundary segments that generate the 
branch go to zero. A more elusive case is that of splicing. Through a splicing 
deformation, a decreasing length of boundary maps to the skeleton branch. This 
is because either the skeleton length and its associated boundary length are both 
reduced, or because the deformation allows boundary points to be mapped to 
adjacent skeleton branches. For this reduction in the length of the generating 
boundary, we do not have a corresponding reduction of the length of the skeleton 
branch. In fact, in a splice operation the length of the skeleton branch is a lower 
bound imposed by the presence of the ligature. This is the major cause of the 
perceived instability of the skeletal representation. Weighting each point on the 
boundary which gave rise to a particular skeleton branch allows us to eliminate 
the contributions from ligatures, thus smoothing the instability. Since a smooth 
shape deformation induces a smooth change in the boundary, the total shape- 
measure along the branch has to vary smoothly through any deformation. 

Just like the radius of the osculating circle, key shape elements such as necks 
and seeds are associated with local variations of the length ratio. For instance, 
a neck is a point of high rarefaction and, thus, a minimum of the shape-measure 
along the branch. A seed is a point where the front of the evolution of the eikonal 
equation concentrates, and so is characterised by a maximum of the ratio. 
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Another important property of the shape-measure is its invariance to bending 
of the shape. Bending invariance derives from the fact that, if we bend the shape, 
we loose from one side the same amount of boundary-length that we gain on the 
opposite side. To see this we let k be the curvature on the skeleton, and k\ 
and k 2 be the inward curvature on the corresponding boundary points. Further, 
suppose that 9 is the angle between the border tangent and the skeleton tangent. 

Let p = 2k+ P ~ ^2(cos 9 + pr) and 

^ p 2fc -I- ki 2k{l — rki) + kiCOs9 

^ cos 9 + pr cos9 + 2rk + j^^ki 2rk{l — rki) + cos9 

Substituting the above in dU, we have 



dl2/ds 



cos 9 

1 2fc(l — rfci )+fci COS 0 

^ 2rk{l — rki)-\-cos9 



2rk{l — rki) + cos 9 
1 — rki 



Thus we find that dl 2 /ds = 2rk + dli/ds, or dl 2 /ds — rk = dli/ds + rk. That 
is, if we bend the image enough to cause a curvature k in the skeleton, what we 
lose on one side we get back on the other. 



4 Measure Extraction 

The extraction of the skeletal shape measure is a natural by-product which comes 
for free when we use the Hamilton- Jacobi approach for skeleton extraction. This 
is a very important property of this shape-measure. Using the divergence theorem 
we can transport a quantity linked to a potentially distant border to a quantity 
local to the skeleton. Using this property, we can prove that the border length 
to shock length ratio is proportional to the divergence of the gradient of the 
distance map. 

As we have already mentioned, the eikonal equation induces a system that 
is conservative everywhere except on the skeleton. That is, given the field F 
defined as the gradient of the distance map, the divergence of F is everywhere 
zero, except on the skeleton. 

To show how the shape-measure can be computed in the Hamilton- Jacobi 
setting, we consider a skeleton segment s and its e-envelope. The segment s 
maps to two segment borders h and l 2 - The evolution of the points in these 
border segments define two areas A\ and enclosed within the e-envelope of 
s, the segments of boundary li and I 2 , and the trajectories b\ and and 
and 62 of the endpoints of li and l 2 - The geometry of these areas is illustrated 
in figure El 

Since -F — 0 everywhere in A\ and by virtue of the divergence theorem 
we can state that the flux from the two areas are both zero, i.e. d>A\{F) — 0 and 

{F) — 0. The trajectories of the endpoints of the border are, by construction, 
parallel to the field, so the normal is everywhere normal to the field and thus 
there is no flux through the segments b\, b\, b\ and b"^- On the other hand the 
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field on the shape-boundary is always normal to the boundary. Hence, the fiux 
through the border segments l\ and I2 is equal to the length ^{ll) and £{12) of 
the segments h and I2 respectively. 

Since <Pa\{F) = 0 and (Pa^{F) = 0 the fiux that enters 
through the border segments li and I2 has to exit through 
the e-envelope of s. That is, if ei and 62 are the sides of A\ 
and on the e-envelope of s, we have <Pei (F) — (F) and 

(Pe^iF) = <Pi^{F). This, in turn, implies that the fiux through 
the whole e-envelope of s is (Pe{F) = £{li) + £{h). 

Since ■ F de = ■ F ds, and the value of 

the fiux through the e-envelope of s is independent of e, we 
have ■ F ds = £{h) + £(12)- 

Taking the first derivative with respect to ds we have, for 
each non-singular point in the skeleton, V ■ F = dli/ds + 
dh/ds. 

4.1 Computing the Distance between Skeletons 

This result allows us to calculate a global shape-measure for each skeleton branch 
during the branch extraction process. For our matching experiments we have 
used a simple graph representation where the nodes are junctions or endpoints, 
and the edges are branches of the skeleton. When we have completed the thinning 
of the shape boundary and we are left only with the skeleton, we pick an endpoint 
and start summing the values of the length ratio for each skeleton points until we 
reach a junction. This sum ^ ' F{xi) over every pixel xi of our extracted 
skeleton branch is an approximation of ■ F ds = J^{dli/ds + dh/ds) = 
£{li) + £{12) the length of the border that generates the skeleton branch. 

At this point we have have identified a branch and we have calculated the 
total value of the length-ratio along that branch, or, in other words, we have 
computed the total length of the border that generated the branch. We continue 
this process until we have spanned each branch in the entire skeleton. Thus 
we obtain a weighted graph representation of the skeleton. In the case of a 
simple shape, i.e. a shape with no holes, the graph has no cycles and thus is an 
(unrooted) tree. 

Given this representation we can cast the problem of computing distances 
between different shapes as that of finding the tree edit distance between the 
weighted graphs for their skeletons. 

Tree edit distance is a generalization to trees of String edit distance. The edit 
distance is based on the existence of a set S of basic edit operation on a tree and 
a set C of costs, where G G is the cost of performing the edit operation s £ S. 
The choice of the basic edit operations, as well as their cost, can be tailored to 
the problem, but common operations include leaf pruning, path merging, and, 
in case of an attributed tree, change of attribute. Given two trees Ti and T2, the 
set S of basic edit operations, and the cost of such operation C = Cs,s £ S, we 
call an edit path from Ti to T2 a sequence Si, . . . , s„ of basic edit operations that 
transform Ti into T2. The length of such path is I = -I- • • • -I- ; the minimum 




Fig. 3. The flux 
through the bor- 
der and through t 
are equal 
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length edit path from T\ to T2 is the path form Ti to T2 with minimum length. 
The length of the minimum length path is the tree edit distance. 

With our measure assigned to each edge of the tree, we define the cost of 
matching two edges as the difference of the total length ratio measure along the 
branches. The cost of eliminating an edge is equivalent to the cost of matching 
it to an edge with zero weight, i.e. one along which the total length ratio is zero. 



5 Experimental Results 





Fig. 4. A disappearing pro- 
trusion which causes instability 
in shock-length, but not in our 
measure 



In this section we asses the ability of the proposed measure to discriminate 
between different shapes that give rise to skeletons with the same topology. We 
will also asses how smoothly the overall measure goes through transitions. 

As demonstrated earlier in the paper, we 
know that the length ratio measure should be 
stable to any local shape deformation, including 
those that exhibit an instability in shock length. 

This kind of behaviour at local deformations is 
what has led to the idea that the skeleton is an 
unstable representation of shape. 

To demonstrate the stability of the skeletal 
representation when augmented with the length 
ratio measurement, we have generated a se- 
quence of images of a rectangle with a protrusion 
on one side (Figure^). The size of the protrusion 
is gradually reduced throughout the sequence, 
until it is completely eliminated in the final im- 
age. In figure 0 we plot the global value of the length ratio measure for the shock 
branch generated by the protrusion. It is clear that the value of the length ratio 
measure decreases monotonically and quite smoothly until it becomes zero when 
the protrusion disappears. 

In a second set of experiments we have aimed to 
assess the ability of the length ratio measure to dis- 
tinguish between structurally similar shapes. To do 
this we selected two shapes that were perceptually 
different, but which possessed skeletons with a very 
similar topology. We, then, generated an image se- 
quence in which the two shapes were morphed into 
one-another. Here the original shapes are the start 
and end frames of the sequence. At each frame in 
the sequence we calculated the distance between the 
start and end shapes. 

We repeated this experiment with two morphing 
sequences. The first sequence morphed a sand shark into a swordfish, while the 
second morphed a donkey into a hare. 




Fig. 5. The measure of the 
skeleton segment generated 
by a protrusion 
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Fig. 6. Morphing sequences and their corresponding skeletons: sand shark to sword sh 
on the left, and donkey to hare on the right 



To determine the distance between two shapes we used is the Euclidean 
distance between the normalized weights of matched edges. In other words, the 

distance is D{A, B) = ~ where ef and ef are the normalised 

weights on the corresponding edges indexed by i on the shapes denoted by A 
and B. The normalised weights are computed by dividing the raw weights by 
the sum of the weights of each tree. 



We apply this normalized length ratio mea- 
sure to ensure scale invariance: two identical shapes 
scaled to different proportion would have different 
ratios due to the scale difference, but measure along 
equivalent branches of the two shapes would vary by 
a constant scale factor: the ratio of the lengths of 
the borders. Since the the sum of the weights of the 
edges of a tree is equal to the total length of the 
border, dividing the weights in each branch by this 
quantity we have reduced the two measurements to 
the same scale. In this way the relevant quantity is 
not the absolute magnitude for a branch, but the 
magnitude ratio with other branches. 

There is clearly an underlying correspondence 
problem involved in calculating the distance in this 
way. In other words, we need to know which edge 
matches with which. To fully perform a shape recog- 
nition task we should solve the correspondence 
problem. However, the aim of the work reported 
here was to analyze the properties of our length 
ratio measure and not to solve the full recognition 
problem. Thus for the experiments reported here we 
have located the edge correspondences by hand. 




(a) Distances in sh 
morphing sequence 




(b) Distances in don- 
key to hare morphing 
sequence 



For each morphing sequence, in figure 0 we plot Fig. 7. Distances from rst 
the distance between each frame in the sequence and last frame of the mor- 
and the start and end frames. The monotonicity of phing sequences 
the distance is evident throughout the sequences. 

This is a proof of capacity of our length ratio measure to disambiguate between 
shapes with topologically similar skeletons. 



To further asses the ability to discriminate between similar shapes, we se- 
lected a set of topologically similar shapes from a database of images of tools. 
In the first column of figure 0we show the selected shapes. To their right are 
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the remaining shapes sorted by increasing normalized distance. Each shape is 
annotated by the value of the normalized distance. 

It is clear that similar shapes are usu- 
ally closest to one-another. However, there 
are problems due to a high sensitivity to oc- 
clusion. This can be seen in the high rela- 
tive importance given to the articulation an- 
gle. This is due to the fact that, in the pliers 
images, articulation occludes part of nose of 
pliers. While sensitivity to occlusion is, with- 
out a doubt, a drawback of the measure, we 
have to take into account that skeletal rep- 
resentation in general are highly sensitive to 
occlusion. 



6 Conclusions 

In this paper we presented a shape measure 
defined on the skeleton. This measure has 
been used in the literature as a branch rele- 
vance measure during skeleton extraction and 
pruning. We state that the informational con- 
tent of the measure goes beyond this use, and 
can be used to augment the purely structural information residing in a skeleton 
in order to perform shape indexation and matching tasks. 

We show that the shape measure has a number of interesting properties that 
allow it to distinguish between structurally similar shapes. In particular, the 
measure a) changes smoothly through topological transitions of the skeleton, b) 
is able to distinguish between ligature and non-ligature points and to weight 
them accordingly, and c) it exhibits invariance under “bending” . What makes 
the use of this measure particularly appealing is the fact that it can be calculated 
with no added effort when the skeleton is computed using the Hamilton- Jacobi 
method of Siddiqi, Tannenbaum and Zucker. 
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Abstract. 2D curve representations usually take algebraic forms in 
ways not related to visual perception. This poses great difficulties in 
connecting curve representation with object recognition where informa- 
tion computed from raw images must be manipulated in a perceptually 
meaningful way and compared to the representation. In this paper we 
show that 2D curves can be represented compactly by imposing shap- 
ing constraints in curvature space, which can be readily computed di- 
rectly from input images. The inverse problem of reconstructing a 2D 
curve from the shaping constraints is solved by a method using curva- 
ture shaping, in which the 2D image space is used in conjunction with its 
curvature space to generate the curve dynamically. The solution allows 
curve length to be determined and used subsequently for curve model- 
ing using polynomial basis functions. Polynomial basis functions of high 
orders are shown to be necessary to incorporate perceptual information 
commonly available at the biological visual front-end. 



1 Introduction 

The first goal of visual perception is to make the structure of the contrast vari- 
ation in the image explicit. For stationary images, the structure is organized 
through the curvilinear image contours. From the point of view of information 
theory, the probability that an image contour is formed by some random dis- 
tribution of contrast is extremely small and thus is highly informative. For the 
contour itself, it is also more meaningful to identify the part of the image con- 
tour that is more informative than other parts of the same contour. Though 
this principle is important from either the view of information theory or data 
compression, it is nonetheless essential to inquire about the inverse problem, i.e., 
how can the less informative part be recovered from the more informative part? 
This paper is about both problems in the 2D curve domain with main emphasis 
on the inverse problem. 
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Methods for representing 2D curves are usually segment-based with each seg- 
ment defined by either a straight line (polygon) or a parameterized curve (spline) . 
The segmentation points that separate segments are determined from properties 
computed along the curve, among them curvature is the most commonly used 
TO . However, the properties for curve segmentation are generally highly sensi- 
tive to scale changes because the computation is commonly conducted after the 
contours are identified, which is notoriously dependent on the scale used. This 
problem can be avoided by using methods that compute the curvature along the 
contour directly from the image and a carefully designed selection scheme for 
the scales used in the computation Pi- 

Curvature has been considered to be one of the major perceptual properties of 
2D shapes ■MIDI- It is invariant to rigid transformations and can be computed 
by our physiological system m It has been used extensively in shape matching 
0 and object recognition as well as for shape modeling in both 2D 0 and 

3D P. 

From these observations regarding a 2D curve and its perception, the prob- 
lem of 2D curve modeling can be formulated as a two-stage process: first, the 
perception-based selection of the local parts on the curve to be modeled, and 
second, the measurement of relevant modeling parameters regarding the shape. 
In this paper we also formulate the inverse problem of constructing a curve from 
a given set of parameters that has been selected previously as modeling param- 
eters. The combination of the significance of curvature in visual perception and 
the importance of geometrical modeling of image contours is motivation for de- 
veloping a new framework for 2D curve representation and reconstruction using 
curvature. 



2 Background 

2.1 Direct Curvature Computation from an Image 

Curvature computation on image contours is generally sensitive to noise due 
to the way the computation is conducted, i.e., compute curvature from contour 
detected. This problem can be remedied greatly by computing curvature directly 
from an image at a tentative contour position P. The method can be extended 
to the computation of higher-order differential invariants such as the derivative 
of curvature, which will be used extensively in this paper. 

Let I{x,y) be the input image. A 2D Gaussian kernel is separable and de- 
fined by ijjQo{x,y; a) = 'ipQ{x;cr)tpQ{y,a), with a being a scale parameter and 
iljo{x;a) = (l/-\/27r(T) exp(— a:^/2(T^) an ID Gaussian kernel. The ith-order and 
jth-order differentiations of ip with respect to x and y are given by pij{x, y, a) = 
pi{x;a-)pj{y;a). It can be shown that the curvature at location (xo,yo) of an 
image contour is given by 



p2o{xPy^) *I{x,y) 
Poiix^^y^) * I{x,y) 



K{xo,yo) 



( 1 ) 
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where * is the convolution operator and = {xcosO + ysinO, — a:sin0 + 

ycosO) with 9 being the orientation of the contour. The derivative of curvature 

is then given by dn/ds = kX — t v'') * I{x,y))/d>g, where <l>{x,y,9;a) = 

— {d-i/joi/dO) * I{x,y), X = — • n/d>g, and <Pg = d^jdd with n being the unit 
normal vector to the contour. An example of the curvature computed in this 
way is shown in Figure O 




Fig. 1. The curvature along the contour of an airplane image. 



2.2 Geometry of 2D Curves 

Given a curve c(s) = (x(s),y(s)) parameterized by its curve length s, the 
fundamental theorem of differential geometry for planar curves enables us to 
describe the curve uniquely (up to a rotation 9q and a translation (a, &)) us- 
ing its curvature k(s). This is explicitly formulated by the intrinsic equations: 
x(s) = f cos(9(s)) ds + a, y{s) = f sin(0(s)) ds + b, 9{s) = / k{s) ds + Oq with 
three boundary conditions given to specify (a, 6) and 9q. The curve k(s) is the 
curvature space for c(s). 

Hence the problem of shape representation in 2D image space is equivalent to 
the representation of the function k(s) in curvature space. The primary difficulty 
in using the intrinsic equations directly in curve reconstruction from curvature 
space is that there is no well-defined computational procedure for constraining 
the shape in either the 2D image space or the curvature space from the changing 
shape in the other space. In other words, when taking into account noise, both 
spaces are unstable by themselves. However, by incorporating both spaces into 
a reconstruction algorithm, satisfactory results can be achieved. 
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We will subsequently consider the following geometrical parameters of a 
curve. Given a smooth curve c(s) that is continuous, two points Pq,Pi on 
c(s) at sojSi are such that the respective curvatures k(so),«^(si) are curvature 
extrema, i.e., k'(sq) = k'{si) = 0. The points Pq, Pi are called feature points for 
c(s) (Figure 0). 



Given this background, the problem of 2D shape representation using feature 
points will be to locate the curvature extrema along a curve and construct the 
curve using these extremal points. Traditionally this goal is achieved through 
piecewise interpolation using cubic splines and matching boundary conditions 
at the knots. However, this approach is unable to incorporate the higher-order 
constraints of k and k' . The other problem is the relatively straight segments 
provided by this model, requiring more knots for more curved regions. This fact 
may not be favorable when the scale-space factor is taken into account, which 
requires a more or less even distribution of knot positions along the curve at 
a given scale. These problems can be alleviated by using higher order splines. 
Another problem is the extra feature points inserted by the basis functions. To 
solve this problem, a different approach that works on both image space and 
curvature space is required. 

The parameter used for the interpolation is also problematic, especially when 
curve length parameterization is required. For a given image, the curve length 
along an image contour can generally be estimated quite accurately, and the 
relevant geometrical and modeling parameters can be computed. However, the 
inverse problem of finding the curve from a given set of boundary conditions does 
not provide information on curve length. The method presented in the next sec- 
tion provides the curve length information as one of its results. This information 
can then be used for modeling the curve using the high-order polynomial basis 
functions presented in Section 0] 

3 Curve Representation by Cnrvature Space Shaping 

Given two feature points, Pq, P\, on an unknown curve c(s) in an image and their 
associated tangent orientations, Oq,0i, as well as their signed curvatures, /cq, ki, 
we now present a method to solve the problem of finding the curve that satisfies 
the given boundary conditions with the property that there is no computable 
curvature extremum in between Pq and Pi other than those at Pq and Pi . 



S = SQ *(«o) 




k{sq) = Ki = 1/ro 



Fig. 2. The geometrical factors that determine a 2D curve. 
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Fig. 3. Dynamical moves for a curve seg- 
ment between two convex points. 




tangent line got between Co and Ci 
as an estimate for curve length 



Fig. 4. Dynamical moves for a curve seg- 
ment between one convex and one concave 
point. 



Let the osculating circle at Pq, Pi be Cq, Ci respectively, and the tangent line 
between Cq,Ci in the direction of tg = (cos 6*o, sin 6*o) be got (the unit vector g 
along goi at the Cq end has the property g to > 0). Among the four tangent lines, 
goi is chosen to be one of the two non-crossing ones closest to Pq if 
and one of the two crossing ones if kqKi < 0. The curve c(s) is constructed 
by dynamically moving stepwise from Pq along a direction that will gradually 
changed into the direction of ti = (cos 0i, sin 0i) while gradually changing the 
curvature of corresponding osculating circle in the process (Figures 0 and0). 
Since c(s) is unknown, the curve length between Pg and Pi cannot be determined 
in advance. Rather, we use the length of the tangent line goi as an initial estimate 
for the curve length. 

Let the desired steps for reaching Pi from Pg be n, and let the length of 
tangent line gi at each step be sf for i = 1, . . . n. The direction of movement 
at each step i is determined by the corresponding 9i and gi by {ti + gi)j2, i.e., 
move half way in between ti and gi . The curvature Hi of the osculating circle Ci 
is given by ko + {ki — kq) / n. The distance di to be moved consists of a movement 
along gi followed by a movement along Ci and is given by di = sf /(n — f -I- 1). 
The orientation change caused by this movement is then given by A9 = Kidi = 
{ki — Ko)di/n. This is formulated in such a way that if the estimated curve 
length Si is indeed the curve length, then at each step we will move precisely 
\/{n — i + 1) of the curve length and will complete our journey in n steps. The 
curve length is thus given by 



i + 1 

The corresponding curvature spaces of the curves in Figures 0 and 0are given 
in Figures0andEl respectively. Under the solvability conditions explained in the 
next section, it can be shown that the movement will approach Pi in n steps with 



n 
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Fig. 5. The curvature space for a curve 
segment with two convex points. 




Fig. 6. The curvature space for a curve 
segment with one convex and one concave 
point. 



the desired boundary conditions, and the following limits can be established: 
lim Pn = Pi, lim 9n = Oi, lim = k\ 

n—¥oo n—¥oc) n—¥oo 

Each of the n segments of curve c(s), according to the curvature space, is 
a partial arc on the osculating circle Ci with constant curvature (Figure Q, 
which can be approximated by a piecewise straight segment. Even though the 
curves have great similarity, their curvature spaces have completely different 
shapes. This illustrates the difficulty in working from only one of the spaces. In 
comparison, a constant curvature segment can better track the tangent line and 
converge faster to the destination than the straight segment counterpart because 
the osculating circle at each point bends toward rather than away from the line. 
This implies fewer steps are required and better precision. 



3.1 Solvability Conditions 

There are two conditions governing whether this problem has a solution. One 
corresponds to the “sidedness” of the object, since the sign of curvature is de- 
fined according to which side of the object the normal vector lies by the Frenet 
equation t' = kti. The tangent line gi gives an estimate of the curve length of 
c(s) and at the same time defines on which side the object lies. It is unsolvable 
when the boundary conditions create an impossible object. The other condi- 
tion of solvability is whether during the process the curvature enters an area in 
which extra extrema have to be created. These areas are denoted unsolvable in 
Figures 0 and El These two unsolvability conditions are illustrated in Figure |H1 
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curve with piecewise constant curvature curve with straight line Segments 




Fig. 7. The two segment models of a curve. 





Fig. 8. The two unsolvability conditions between two convex points. 



4 Curve Representation by Polynomial Basis 

The curve length of an arbitrarily parameterized curve c(t) is s = f |c'(t)| dt. 
From this formulation it is clear that if c(t) is represented by polynomial basis 
functions, its curve length will not be polynomial, and vice versa. Hence, to use 
a polynomial basis we can either work in the image space of c{t) = {x{t),y{t)) by 
fitting the boundary conditions Cq, Ci, 0o,9i, Kq,Ki, Kq, k[, or work in curvature 
space through the intrinsic equations on the same set of boundary conditions. 
Both methods result in a set of highly nonlinear equations with the existence 
of a solution questionable. In this section we introduce a compromise method 
using polynomial basis functions in image space that satisfy all the boundary 
conditions but do not guarantee that new curvature extrema will not be inserted 
in the models. 

4.1 Curves from Her mite Splines 

The Hermite polynomials Hi j(t) of order L with i = 0, . . . , L and j = 0,1 
satisfy the cardinal property: H^q{Q) = = 0, = 0,il*Q(l) = 
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Jfei, where the superscript k indicates the order of differentiation. This set of 
equations defines polynomials of order 3 for L = 1, of order 5 for L = 2, and of 
order 7 for L = 3. 

The cardinal property is useful for fitting boundary conditions of various 
differential orders. We will consider here the case of first, second and third order 
differentials. Let Pj be the ith derivative of the curve at Pj = c{tj). The curve 
segment connecting the two points (t = to, t = ti) using Hermite splines is 



j=0 i=0 ^ 



= 1,... ,3 



This formulation is given in terms of the differentials at two boundary points, 
which are not readily available since the problem is given the conditions of lo- 
cation (Po,Pi), orientation {6q,9i), curvature {kq,ki), and differential of cur- 
vature {k'o,k'i). Since the estimation of P® is generally noisy, it is necessary to 
use curve length as a parameter. Hence, the curvature k = x'y" — x"y' and 
k' = x'y"' — x"'y' since -I- {y')‘^ = 1. Given k and k' allows us freedom 
to choose two additional conditions to fix x'',y” and x''',y'". This can be done 
arbitrary since the problem itself does not dictate these conditions. 




Fig. 9. The image contour of the air- Fig. 10. The curvature space and extrema for 
plane in Figured the airplane contour. 



5 Examples 

For the airplane contour in Figure El the curvature space is given in Figure El 
in which the curvature extrema are identified. The corresponding feature points 
are also marked in Figure |3 These points are the component partition points 
in which the concave points separate components while the convex points mark 
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the partition of different segments within the same component. One of these 
components with feature points within the segment identified is shown in Fig- 
ure im The curve length is estimated from the curve computed using the method 
of curvature shaping, and subsequently used in the representation by Hermite 
basis functions. Three different orders of the basis functions were used. Bases 
of order 3 used the location and tangent orientation information only. Bases of 
order 5 also matched boundary conditions for the curvature, while Hermite bases 
of order 7 satisfied the additional condition that these points are actually feature 
points with extremal curvature. 




Fig. 11. Parts of the airplane represented by Hermite basis of order 3(--'), 5 ( — ),7 
( ), compared to the original ( — ). 



6 Discussion 

6.1 Scale Space 

Scale space manifests its effect mostly in computation. From the formulation 
in Section rrn it can be observed that after the image contours are computed, 
the effect of the 2D scale-space kernel parameterized by rectangular Cartesian 
coordinates is equivalent to the effect of an ID scale-space kernel parameterized 
by curve length. This is because the image contour is computed by orienting the 
kernel 'i/'oi in the direction of the contour HU. This essentially creates a “curva- 
ture scale space” |2j, in which variations within a fixed scale are gradually lost 
when the scale getting coarser. This results in a separation of feature points on 
the curve with a distance proportional to the scale used for the computation. 
Hence, even though the curvature k{s) is a highly nonlinear function of a;(s) 
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and y{s) and its shape cannot be exactly predicted for a given scale, the feature 
points with extremal curvature can nonetheless be located by searching the cur- 
vature space of finest scale and partitioning the curve with segments of length 
proportional to the scale without actually computing the curvature scale space 
for coarser scales. 

6.2 Perceptual Boundary Conditions 

The measure of distance in biological visual perception is provided by compar- 
ison between a reference length and what is to be measured, i.e., there is no 
intrinsic metric. This renders computations using distance measure (such as op- 
tical flow and curvature) imprecise. On the other hand, an orientation measure 
has built-in mechanisms with a certain degree of precision. From these observa- 
tions, the primary measurement of the local shape of a curve will be the position 
and orientation relative to a 2D Cartesian coordinate system, while curve length 
and curvature are much less precise in terms of measurement. Hence the primary 
boundary conditions related to perception are locations and orientations. How- 
ever, we do show that when secondary boundary conditions such as curvature 
and its derivative are available, the representation is much more compact and 
precise. For example, by using cubic or quintic Hermite bases for the component 
in Figure im precision can be augmented by adding more knot points. However, 
it is not clear which ones to choose since there is no special attribute in curvature 
space to facilitate this choice. 

6.3 Component Partitions Using Curvature Space 

There are two different kinds of information presented through the curvature 
space when the whole contour is considered. The prominent ones are actually 
component partition points (negative extrema) of the shape or segment parti- 
tion points (positive extrema) within a component. This can be seen in Figures ^ 
and uni Feature points for each segment have to be identified within the segment 
rather than compared to every segment in the contour, and their extremal prop- 
erty is essential to the modeling. The inadequacy of lower-order Hermite bases 
to represent a curve segment is clearly seen in Figure o since these do not take 
into account the extremal property at these points. 

7 Conclusions 

A compact description of a smooth curve was presented in this paper, based on 
curve features that are perceptually-important. The geometrical model of these 
features was defined by location, orientation and curvature, with the additional 
property that the curvature reaches extremal values at feature points. Being 
able to describe compactly an object contour is of great importance in object 
recognition, especially when the description is independent of the viewpoint. We 
also develop a method to identify the curve from a given set of perception-based 
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boundary conditions at prescribed feature points. One of the results is a good 
estimation of curve length that can subsequently be used by polynomial basis 
functions for curve modeling. However, to satisfy the given boundary conditions, 
much higher-order polynomials are needed than what are commonly used. 
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Abstract. In this paper we show that all images are topologically 
equivalent. Nevertheless, one can define useful pseudo-topological 
properties that are related to what is usually referred to as topological 
perception. The computation of such properties involves low-level 
structures, which correspond to end-stopped and dot-responsive visual 
neurons. Our results contradict the common belief that the ability for 
perceiving topological properties must involve higher-order, cognitive 
processes. 

Keywords: Topology, Euler number, closure, curvature, visual percep- 
tion 



1 Introduction 

A fundamental issue of image analysis and understanding is the mathematical 
description, or representation, of the input data. The basic representation of 
an image and indeed the closest to the physical process of vision is that of a 
graph of a function I : {x,y) — >■ I{x,y) from a subset U of B? to R\ here U 
is the retinal plane and I{x,y) is the light intensity at the point (x,y) G U. 
From a geometrical point of view, the graph of / is a Monge patch, that is a 
surface L of the form L = {x,y, I{x^y)}. Alternatively, L can be considered 
as the image of the mapping cj) : (a;, y) — >■ {x,y,I{x,y)). Then, the process of 
image formation can be seen as a mapping from the surface M of a physical 
object to the Monge patch L representing the image. Let S be the surface of an 
object in R^. Consider a coordinate system {x,y,z}, whose origin coincides with 
the position of the viewer and let {x,y} be the image plane. The visible part 
M of S can be given a Monge-patch representation M = {x,y, f{x,y)), where 
z = f{x, y) is the distance of the point r = (x, y) on the image plane of the 
observer to s = (x, y, z) in M. Note that, if orthographic projection is assumed, 
/ and I share the same domain U and there is a one-to-one correspondence 
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between p = (xo,yo, f(xo,yo)) G M and q = (xq, ?/o, J/o)) 6^0. The 

extension to the case of perspective projection is immediate. If a surface M is 
transformed by a continuous transformation into a surface M', a new image L' 
will be generated by M'; the relation between the transformation of the image 
and the transformation of the underlying object has been made precise in 

The representation of images as surfaces has two advantages: first it is close 
to the original data structure and does not require any high level process to be 
generated, second it allows one to use the technical machinery of geometry to 
investigate it. In particular the condition of differentiability allows to make use 
of a very powerful theorem of differential geometry, the Gauss-Bonnet theorem, 
which provides a link between global and local properties of surfaces. Surfaces 
can be given a global classification based on the notion of topological invariants. 
A property of a surface is called a topological invariant if it is invariant under 
homeomorphism, that is under an one-to-one continuous transformation that 
has a continuous inverse. Two surfaces are said to be topologically equivalent if 
they can be transformed one in the other by a homeomorphism. For instance, the 
number of holes in a surface is a topological invariant. In particular here we want 
to investigate, how the topological properties of object surfaces are reflected in 
the images. It will be shown in the next section that all images are topologically 
equivalent, i.e. any two images can be transformed one into the other by means 
of an homeomorphism. From this result it follows that the topological properties 
of an object’s surface are not intrinsic properties of its image. 

We shall show how topological properties of the object underlying the images 
can be found by using 2T)-operators, i.e. operators whose output is different 
from zero only in case of 2Z?-features such as corners, line-ends, curved lines and 
edges. These operators can be associated to the activity of nonlinear end-stopped 
neurons. Mechanisms possibly underlying the activity of such neurons have been 
investigated by a few authors, e.g. mam. A general theory for end-stopping 
and 2£)-operators, however, is still missing, but attempts to identify the basic 
ingredients for such a theory have been made 

Finally, we will show how 2Z?-operators can be used to provide an alter- 
native explanation for some experimental findings m, which have suggested 
that the human visual system might be quite sensitive to global, “topological” 
characteristics of images. 

2 The Gauss-Bonnet Theorem 

The Gauss-Bonnet theorem is one of the most important theorems in differential 
geometry, in that it provides a remarkable relation between the topology of a 
surface and the integral of its Gaussian curvature. 

Let i? be a compact region (e.g. a Monge patch) whose boundary dR is the 
finite union of simple, closed, piecewise regular curves Ci. Gonsider a polygonal 
decomposition of i?, that is a collection of polygonal patches that cover i? in a 
way such that if any two overlap, they do so in either a single common vertex 
or a single common edge m- 
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Thus a polygonal decomposition T> carries with it not only the polygonal 
patches, called faces, but also the vertices and edges of these patches. Suppose 
V has / faces, e edges and v vertices, then y = / — e + uis the Euler-Poincare’ 
characteristic of R and it is the same for all polygonal decompositions of R m- 
The Euler-Poincare characteristic can be extended, in a natural way, to regular, 
compact surfaces. In Fig. 0some examples of closed surfaces with different val- 
ues of X Etre shown. Two compact surfaces are topologically equivalent if and 






Fig. 1. Surfaces with different Euler-Poincare’ characteristics (y = 2, 0, —2 from left 
to right). 

only if they have the same Euler-Poincare’ characteristic x USE). Here, we are 
interested in Monge patches, that represent both the visible part of surfaces and 
their images. If the Monge patch M is a simple region, that is is homeomorphic 
to an hemisphere, x = 1 |0| , if it has an hole x = Oj and in general 

X = {l-riholes)- ( 1 ) 

Note that if, instead of regions, we consider regular surfaces, equation ^ 
becomes x = 2(1 — Uhoies)- The definition of x holds for connected surfaces; 
we shall now extend the definition of x to the case of not connected surfaces. 
Suppose there are n object in the scene, the global Euler-Poincare’ characteristic 
XT is then simply xt = E” Xj- 

Consider a region R. The Euler-Poincare’ characteristic is related to the 
curvature of R by the celebrated Gauss-Bonnet formula 

J ! KdA + J kgds + E 9, = 27TX, (2) 

here dA is the infinitesimal area element, Ci are regular curves forming the 
boundary OR of R, kg is the geodetic curvature computed along the curve Ci 
and 9i are the external angles at the vertices of dR. 

If S' is a compact orientable surface then f fg KdA = 2ttx{S) 1 1 2lti] . 

This is a striking result: how is it possible that, when integrating a local 
property of a surface, we obtain global topological invariants of that surface? 
Let us consider the surface of a sphere S^ in R^. For a given radius r, the 
Gaussian curvature is K = Ijr"^ and / Jg 2 KdA = 47t. Note that this result does 
not depend on r, since as r increases K decreases but the area of the surface 
increases. More importantly, the result is the same for all surfaces topologically 
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equivalent to a sphere. Suppose S'^ is transformed into a surface S by deforming 
the sphere with a dent. In this case the area increases, and the elliptic part 
of the dent gives additional positive values. At the same time, however, new 
negative values of K are produced at the hyperbolic regions of the dent, and, 
as a consequence, positive and negative curvatures cancel out and the total 
curvature remains constant. 



3 All Images Are Topologically Equivalent 

In this section we give a formal proof that all images are topologically equivalent. 
Topological equivalence of two images implies that, given any pair of images 
with representation L = {x,y, I{x,y)) and N = (x,y, J{x,y)), there exists an 
homeomorphism taking one into the other. First we prove the following 

Lemma 1 Let I be a function from U to R and let 4> : (x,y) — >■ {x,y, I{x,y)) 
from U to R?. If I is continuous then 4> is an homeomorphism. 

Proof. The map 4> is one-to-one and can be written as (f> = i x I where x is the 
Cartesian product and i is the identity map of U onto itself, which is obviously 
continuous. Since I is continuous by hypothesis, then (p is continuous j^. The 
inverse of p, is nothing else than the orthographic projection from M to U, which 
is continuous 0 and the assertion follows. 

From the lemma it follows that 

Proposition 1 Let L and N be images defined as the graph of continuous func- 
tions I and J respectively; then there exists an homeomorphism h : L ^ N. 



Proof. Consider the maps (p : (x,y) — >■ {x,y, I{x,y)) from U to L and ip : 
(x,y) — >■ {x,y,J{x,y)) from U to N and define h = ip o which is a map 
from L to N. The map h is an homeomorphism being the composition of two 
homeomorphisms p and ip. 

If / is supposed to be smooth, that is to have continuous partial derivatives 
of any order in a open set U in R^, then the surface is regular 0, and then it is 
easy to prove that all surfaces, which are the graph of differentiable functions, 
are diffeomorphically equivalent; that is to say that for any pair of surfaces there 
exists a bijective map, smooth with its inverse, taking one surface onto the other. 

From proposition C] it follows that all images must have the same Euler- 
Poincare’ characteristic y, that can now be computed by making use of the 
Gauss-Bonnet theorem. 

The Gaussian curvature of a Monge patch L = {x,y,I{x,y)} is given by jB] 

K = I Hj |(l -I- (dl/dx)^ (dl/dy)^) ^ where | Hi \ is the determinant of 
the Hessian matrix Hi of the function I. 

Let now V G R^ he a disk of radius r with boundary C\V is a flat surface, 
hence / KdA = 0 and = 0 because there are no vertices. In this case the 
geodetic curvature along C is equal to the curvature k = \/r oi C and it follows 



Global Topological Properties of Images 289 



that kgds = 2n. Therefore x = Ij and, since x is a topological invariant, it 
must be the same for all images. 

It must be pointed out that the result applies only to Monge patches, and 
hence not to any regular surface in however we are interested in images, 
which indeed are graphs of functions, and hence Monge patches. 

In our proof we assumed image intensity J to be a continuous function. 
This is a common assumption justified by the observation that most images are 
band limited due to the imaging system. In human vision there is a good match 
between the band limitation and the sampling density. 

We have seen here that all images are topologically equivalent. Of course this 
is is not true for the objects that generate different images. The surfaces of these 
objects may have different topological properties, e.g. for a sphere and a torus 
X = 2 and x = 1 respectively (see Fig. Pi, and their visible parts have x = 1 
and X = 0 respectively (see Eq. P; however, their images have x = 1- Thus the 
topological properties of an object’s surface cannot be determined as topological 
properties of the corresponding image surfaces. In other words, characteristics 
such as holes or discontinuities do not exist in the images per se, indeed there is 
no a priori reason to interpret dark or light blobs as holes. Nevertheless, one can 
do so and successful methods for estimating the Euler characteristic of binary 
images have been presented fnrni . 

4 Pseudo- Topological Properties of Images 

It has been mentioned before that there is some experimental evidence suggest- 
ing that the human visual system can discriminate on the basis of what appear 
to be different topological properties of the objects underlying the image. Then 
it is of interest to search for image properties, which reflect topological properties 
of the associated objects. We call such properties pseudo-topological properties 
of images, and we shall investigate, which kind of operators are appropriate 
to detect pseudo-topological image properties. To compute pseudo-topological 
properties of images by integrating local features as the outputs of some op- 
erator, then only 2D-operators, that capture the local curvature of the image, 
seem appropriate, even though not all 2D global operators will work in detecting 
pseudo-topological properties of images. Indeed let L be the geometrical repre- 
sentation of an image and suppose that its boundary C is a regular closed curve 
contained in a planar region such that, in C, |V/| = 0. Then we have, as seen 
before. 



From the Gauss-Bonnet formula / Jy KdA -|- 27t = 27tx, and, since x = 

J fy KdA — 0. If we now extend an image by a planar frame we can And a 
curve C such that Eq. 0 holds. Then, for all “framed” images the total cur- 
vature is equal to zero. For any practical purpose this also implies that any 
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deviation from zero will be a boundary effect. Therefore, a straightforward ap- 
plication of the Gauss-Bonnet theorem to image analysis cannot lead to useful 
image properties. We will make no attempt to develop a general theory about 
the invariance properties of integral 2D-operators. Instead, we will show how 
a specific 2D-operator, namely the clipped-eigenvalues (CEV) operator, can be 
used to compute pseudo-topological properties of images. This operator has been 
introduced as a model for dot-responsive cells in uni, and described in more de- 
tail in Nevertheless, it seems useful to understand in the present context how 
the operator can be derived from the expression for the Gaussian curvature K. 
The determinant of the Hessian of I can be written as 

^ 4 \ dy'^ ) 4 \ dx'^ dy"^ ) \ dxdy ) 

that is. 






( 5 ) 



where is the Laplacian on the intensity I and e is the eccentricity; they 
determine the eigenvalues of Hj through the formula \\^2 — V^/±e^. The 
operation of clipping is defined as A+ = Max{0,X) and A“ = Min{0,X). The 
CEV operator is then CEV {I) = Aj(I) — A^(/). Note that in case of isotropic 
patches, where e = 0, CEV (/) = V^I. But when a pattern becomes elongated, 
CEV will be less than and will become zero for straight patterns. The local 
operator CEV yields a global measure < CEV > defined as the average of CEV 
on the whole image. 

The main difference between the Gaussian curvature and the operator CEV 
is that the latter is zero for hyperbolic surface patches. Using the clipping oper- 
ation one obtains different signs for positive- and negative-elliptic patches, and 
only patches with absolute CEV values above a small threshold contribute to 

< CEV >. Fig. 0 presents some examples of the action of CEV on different 
images and the corresponding values of < CEV > are also shown. Note that 
I < CEV > I is, within a very small error due to the numerical approxima- 
tion, equal to |xt|, the total Euler-Poincare’ characteristic of the visible parts 
of surfaces in the scene. 

The < CEV > measure exhibits the pseudo-topological invariance illustrated 
above as long as the patterns have the same contrast. A contrast independent 

version of the CEV operator can be defined as CEVn = 7 — /a 1 

^ {c-\-{dI/dx)-^-\-{dI/dy)^)' 

where c is a small constant. With this measure we have obtained results where 

< CEVn > varied with less than 1 % for patterns with different contrasts. 

Pseudo-topological invariance is also limited by the scale of the operators 
and the size and shape of the patterns that are involved. We should mention 
here, that before computing the partial derivatives, the images are low-pass 
filtered with a Gaussian filter, which defines the scale of the GEV operator (that 
can be evaluated on multiple scales). However, any image can be zoomed (by 
nearest-neighbor interpolation) such that patterns are approximated by saw- 
tooth contours (like the two tilted polygons in the third row of Fig. El). In this 
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Fig. 2. Responses of the CEV operator (second and fourth row) to 8 different patterns 
(first and third row). In addition, the mean values of < CEV > (normalized to the 
first image) are given. 



case the scale on which the CEV operator is computed can be finer than the 
highest frequencies of the original patterns. On such zoomed images, we can 
obtain the above pseudo-topological invariance independent of the shape of the 
patterns. 

As concerns simulations of visual functions, however, the issue of strong in- 
variance is less relevant. More important is whether we can predict the experi- 
mental results with reasonable model assumptions. 

5 Simulations of Experimental Data 

Due to Minsky and Papert PH it is a widespread belief that topological prop- 
erties of images can neither be easily computed by simple neural networks, nor 
easily perceived by human observers. In a series of experiments, Chen has shown 
that subjects are sensitive to “topological” properties of images, and he has in- 
terpreted his results as being a challenge to computational approaches to vision. 
For example in ^ he has shown that if two images are presented briefly (5 ms), 
subjects discriminate better between images of a disk and a torus, than they do 
in case of images of a disk and a square or a disk and a triangle, respectively. 
Note that the objects disk, square, and triangle are topologically equivalent, 
X = 2, whereas is case of a torus x = 0, compare Eq. (P) and Fig. P 

Further indications of this kind of performance, that Chen attributed to 
some topological process, have been found in reaction-time experiments jS| where 
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subjects were as fast (750 ms on average) in finding the quadrant containing a 
“closed” pattern (a triangle), with the 3 other quadrants containing an “open” 
pattern (an arrow), as they were in finding an empty quadrant (with the other 
3 quadrants containing a square) - see Fig. 2] 

Proposition 1, on the other hand, demonstrates that the image does not 
directly exhibit the topological properties of the underlying surfaces and that 
some type of further processing is needed, which we have attributed to the action 
of the CEV operator. To simulate the results obtained by Chen, the image shown 
at the left of Fig. 0 was used as an input, to which the CEV operator was applied 
with a result shown in the middle of Fig. 0 The final results are displayed on the 
right of Fig. 0 Here we have not computed the global mean values < CEV > 
but have integrated the local CEV values by low-pass filtering. 

Therefore the intensity map shown on the right of Fig. 0 can be interpreted 
as a local estimate of < CEV >, denoted by CEVlp, that varies with (x,y) 
and depends on the chosen scale. Obviously, if CEVlp were the representation, 
which subjects use for discrimination, the difference between the ring and the 
disc would be larger than the differences between the disc and the rectangle and 
triangle (as found in the experiment). The existence of an end-stopped (7i?P-like 
representation is well motivated by neurophysiological and some psychophysical 
results Hg. The spatial filtering is common in many other models, e.g, of texture 
perception 0. What it assumes is that the similarity metric involves some spatial 
integration. Of course, the CEVpp representation depends on a few parameters, 
mainly the scale of CEV itself and of the low-pass filter. However, the point 
here is that it is easy to predict the experimental results with reasonable values 
of the parameters, and we found the predictions to be stable with respect to 
variations of the parameters. A more comprehensive analysis of how the pseudo- 
topological properties depend on the spatial scales of the underlying operations 
is beyond the scope of this paper. The results shown in Fig. 0 have been obtained 




Fig. 3. Input image (left), output of the CEV operator (middle) and, on the right, low- 
pass filtered CEV operator (CEVlp)- The result on the right predicts the experimental 
findings that humans are more sensitive to the difference between a circle and a ring, 
than between a circle and a square or a triangle. 
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in a similar way. Here the CEVlp result shown on the right illustrates the large 
difference between the triangle and the arrow in this representation. 




Fig. 4. Simulations as in Fig. 0but for a different input. In this example results predict 
the large perceptual difference between the open and the closed shapes arrow and 
triangle. 



6 Conclusions 

We have shown here that all images are topologically equivalent. From this we 
can conclude that any “topological” properties of images depend on, and are 
restricted to, some additional abstractions or computational rules. 

Chen’s experiments reveal that the human visual system is sensitive to “topo- 
logical” properties of the input patterns. Our simulations show that the results 
can be explained by assuming that the visual system evaluates integral values of 
specific curvature measures. These integral values can be seen as corresponding 
to activities of end-stopped, dot-responsive cells that are averaged over space. A 
possible interpretation is that in certain cases, e.g. at short-time presentations, 
the system evaluates integral values of an underlying, end-stopped representa- 
tion. End-stopped neurons in cortical areas VI and V2 of monkeys are oriented 
and more complex than dot-responsive cells. For simplicity, we have restricted 
our simulations to the CEV operator and have argued that the basic requirement 
for pseudo-topological sensitivity is that straight features are not represented. 
However, we have shown before that even the retinal output could be endstopped, 
depending on the dynamics of the input, and that the quasi-topological sensi- 
tivity is not limited to the use of the CEV operator ^j. 

The evaluation of integral values is assumed to be relevant to texture percep- 
tion also. Indeed, we have been able to show that certain human performances in 
texture segmentation can be predicted by integrating the output of 2£)-operators 
in general and of the CEV operator in particular |2j. Thus, we are confident to 
be dealing with a rather general principle of visual processing. 




294 



E. Barth, M. Ferraro, and C. Zetzsche 



Acknowledgment. This work is based on an earlier manuscript, which had 
been supported by a grant from the Deutsche Forschungsgemeinschaft (DFG- 
Re 337/7) to L Rentschler and C. Z. We thank C. Mota and the reviewers for 
valuable comments. 



References 

1. E. Barth, T. Caelli, and C. Zetzsche. Image encoding, labelling and reconstruc- 
tion from differential geometry. CVGIP:GRAPHICAL MODELS AND IMAGE 
PROGESSING, 55(6):428-446, 1993. 

2. E. Barth and C. Zetzsche. Endstopped operators based on iterated nonlinear 
center-surround inhibition. In B. Rogowitz and T. Papathomas, editors. Human 
Vision and Electronic Image Processing, volume 3299 of Proc. SPIE, pages 67-78, 
Bellingham, WA, 1998. 

3. E. Barth, C. Zetzsche, and I. Rentschler. Intrinsic two-dimensional features as 
textons. J. Opt. Soc. Am. A, 15(7): 1723-1732, July 1998. 

4. L. Chen. Topological structure in visual perception. Science, 218(12):699-700, 
1982. 

5. L. Chen. Topological perception: a challenge to computational approaches to vi- 
sion. In e. a. P Pfeifer, editor, Oonnectionism in perspective, pages 317-329. Elsevier 
Science Publishers, North-Holland, 1989. 

6. M. P. Do Carmo. Differential Geometry of Gurves and Surfaces. Prentice-Hall, 
Inc., Englewood Cliffs, NJ, 1976. 

7. A. Dobbins, S. W. Zucker, and M. S. Cynader. Endstopped neurons in the visual 
cortex as a substrate for calculating curvature. Nature, 329:438-41, 1987. 

8. M. Ferraro. Local geometry of surfaces from shading analysis. Journal Optical 
Society of America, A 11:1575-1579, 1994. 

9. J. J. Koenderink and W. Richards. Two-dimensional curvature operators. J. Opt. 
Soc. Am. A, 5(7):1136-1141, 1988. 

10. C.-N. Lee and A. Rosenfeld. Computing the Euler number of a 3D image. Tech- 
nical Report CAR-TR-205, CS-TR-1667, AFOSR-86-0092, Center for Automation 
Research, University of Maryland, 1986. 

11. M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge MA, 1969. 

12. B. O’Neill. Elementary Differential Geometry. Academic Press, San Diego CA, 
1966. 

13. A. Rosenfeld and A. C. Kak. Digital Picture processing. Academic Press, Orlando, 
FL, 1982. 

14. H. R. Wilson and W. A. Richards. Mechanisms of contour curvature discrimination. 
J Opt Soc Am A, 6(1):106-15, 1989. 

15. A. Yuille, M. Ferraro, and T. Zhang. Image warping for shape recovery and recog- 
nition. Gomputer Vision and Image Understanding, 72:351-359, 1998. 

16. C. Zetzsche and E. Barth. Fundamental limits of linear filters in the visual pro- 
cessing of two-dimensional signals. Vision Research, 30:1111-1117, 1990. 

17. C. Zetzsche, E. Barth, and B. Wegmann. The importance of intrinsically two- 
dimensional image features in biological vision and picture coding. In A. Watson, 
editor. Digital images and human vision, pages 109-138. MIT Press, Cambridge, 
MA, 1993. 




Adaptive Segmentation of MR Axial Brain 
Images Using Connected Components* 



Alberto Biancardi^ and Manuel Segovia-Martinez^ 

^ DIS — Universita di Pavia, Via Ferrata,! Pavia, Italy, 
albertoOvision.unipv. it, 

2 CVSSP — University of Surrey, Guildford, UK, 
m . segovia@ee . surrey .ac.uk 



Abstract. The role of connected components and connected filters 
is well established. In this paper a new segmentation procedure is 
presented based on connected components and connected filters. 
The use of connected components simplified the development of the 
algorithm. Moreover, if connected components are available as a basic 
data type, implementation is achievable without resorting to pixel 
level processing. Using parallel platforms with hardware support for 
connected components, the algorithm can fully exploit its data parallel 
implementation. We apply our segmentation procedure to axially 
oriented magnetic resonance brain images. Novel ideas are presented 
of how connected components operations (e.g. moments and bounding 
boxes) and connected filtering (e.g. area close-opening) can be effectively 
used together. 

Keywords: Connected components, connected filters, medical image 
analysis, magnetic resonance imaging (MRI), brain segmentation. 



1 Introduction 

Magnetic Resonance (MR) imaging has become a widespread and relied upon 
source of information for medical diagnosis and evaluation of treatments, espe- 
cially as far as the brain is concerned. The first step in any computer-based pro- 
cedure has to identify the brain region of interest from the surrounding non-brain 
areas in order to prevent any mismatch in the other stages of computation I2CI 
If we look at the degree of autonomy, proposed solutions range from fully in- 
teractive tools [il)lb| . where the program supplies special tools that help human 
operators limit their fatigue and improve their performance, to semiautomatic 
tools m, where a limited amount of human intervention is required (e.g. to 
set initial condition, to validate and possibly correct the final result), to fully 
automatic methods with no human action. Given the specific goal for this seg- 
mentation task, the ideal solution calls for a method of the last class, because it 
is a repetitive operation, with the quality of the first group, because it is very 
difficult to do it well and right. 

* This work was in part supported by the Erasmus Intensive Programme IP2000 
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In the fully automatic class existing methods rely on multiple passes to 
achieve their result; they may be roughly grouped into two classes and sum- 
marised as follows: 

1. extract close-enough regions (mainly by filtering and thresholding each im- 
age) and then refine their contours ( 1 16121 )) : 

2. classify pixels or voxels and then interpolate or extract the boundaries cn 

0 ; 

In both cases the actual segmentation stage requires a refinement because bound- 
aries cannot are not reliable enough to use them directly. 

Thanks to the use of boundary-preserving filters, our method avoids the need 
for further processing. Another novelty of our approach is the extensive use of 
operators based on connected components that allows it to be implemented in a 
data parallel way. 

After an overview of the mathematical and processing tools used by the 
method, each step of the procedure will be presented. A brief analysis and con- 
clusions will complete the paper. 



2 Connected Components and Connected Filters 

The development of image analysis applications is a deeply involved operation 
where multiple stages take place before reaching satisfactory results. While mov- 
ing through these stages the process is faced with variations and options. Being 
able to describe and implement all the algorithms involved in the application 
without loosing focus inside low-level details is fundamental. 

The data parallel paradigm has proved highly valuable as far as the early 
vision stage is concerned. However, when dealing with intermediate-level vision, 
the processing of pixel aggregates or the performing of irregular data movements 
has limited its appeal. It is also true, on the other hand, that most of these image- 
analysis transformations may be expressed in terms of operations on connected 
sets of pixels, which can represent a region, a contour, or other connected parts 
with respect to the chosen image topology. Connected components may, thus, 
represent a kind of structural information that is easily computed even if, by def- 
inition, it is image-data dependent. By using the shape of connected components 
as a modifier of data-parallel operations, an extended data-parallel paradigm can 
be defined |3| that preserves the benefits of the plain one and is able to tackle 
the needs of image analysis tasks. 

An additional benefit is that algorithms can be implemented using only data 
parallel commands, making them suitable to a porting on a massively parallel 
architecture with hardware support for connected components 1 1 41 1 oj . 



2.1 Basic Definitions 

As connected components play such a key role, some basic definitions are given 
to support the description of the proposed algorithm. 
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Given an image (of Af x A4 pixels), the concept of neighbouring pixels can 
be formally described using the image planar graph representation m where 
vertices represent pixels and edges represent the belonging of pixels to each other 
neighbourhood . 

Binary image; A binary image B is a subset of N x N 

Relationship: If two pixels C/, R G B are related by the binary relation C C B x B, 
they will be noted U.C.V 

Connectivity: For each image pixel a (symmetrical) connectivity relationship C 
is defined so that U.C.V iff pixels U and V are connected. If R G B the 
subset N = {{7 G B : V.C.Qj is the (nearest) neighbourhood of V 
Connected components: The partitions of B by C are called connected com- 
ponents. Equivalently, two pixels U,V € B belong to the same con- 
nected component iff there exist a set P = {Pq, Pi, . . . , Pn} such that 
Po = U,Pn = EandVf G N, 0 < i < n : Pi.C.Pi+i 



2.2 Filters by Reconstruction 

Filters by reconstruction HH belong to the family of connected operators and 
collect openings by reconstruction and closings by reconstruction. 

In particular we define opening by reconstruction any operation that is the 
composition of any pixel-removing operation composed with a trivial connected 
opening, which actually reconstructs any connected component that has not been 
completely removed; on the other hand closing by reconstruction is the dual op- 
eration in that it is the composition of a pixel-adding operation composed with a 
trivial connected closing, which completely removes any component which is not 
entirely preserved. Connected openings and connected closings are also known 
under the names of geodesic dilations and geodesic erosions m or propaga- 
tions 1^ depending on the different points of view they were first introduced. 

Filters by reconstruction for grey-scale images are computed by stacking (i.e. 
adding) the result of their binary counterparts applied to each of the (grey-scale) 
image cross sections m- 

Area filters, which are used by this method, belong to the class of filters by 
reconstruction; in particular area openings and area closings use a size criterium 
for the pixel-removing or pixel-adding operations: any component whose size is 
less then the required amount is removed. 

3 An Overview of the Method 

MR brain images are characterised by being heavily textured. But, while texture 
may be essential for the discrimination of sickness types, it makes the segmen- 
tation more difficult. Segmentation tries to extract one or more shapes that 
operators readily perceive notwithstanding the many variations of pixel intensi- 
ties. 

Our method is modelled after the following conjecture: operators are able 
to elicit some homogeneity criteria for image regions and then they choose only 
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those regions that match a higher level goal. Filters by reconstruction were seen 
as the key operation for driving the two stages, as they have the property of 
simplifying the images while preserving contours, while thresholding is used as 
the core of the region selection step. In more detail, the sequence of operations 
is as follows: 

— an area filtering regularises the brain image; 

— a thresholding level is determined by processing an image that evaluates the 
correlation between distance from the image centre and intensity of all the 
image pixels; 

— a thresholding operation that uses the previously computed level highlights 
completely the brain area and, possibly, some other extraneous pieces (e.g. 
ocular bulbs) lying outwardly with respect to the image centre; 

— a final filtering by reconstruction removes all the unwanted regions. 

Since thresholds can be found by means of an automatic procedure in the upper 
frames only, an adaptive procedure extrapolates the missing values in following 
frames going downward. 

3.1 Area Filtering 

Aiming at preserving the integrity of the brain region, and therefore of its bound- 
ary, the need of keeping the thin areas that may be present in the border pixels 
was seen as crucial for a successful filtering. Hence, area filters ^ were chosen 
among the filters by reconstruction because of their shape-preserving ability; 
at the same time these filters reduce variation among pixel values, which again 
plays favourably in limiting the bad effects textures may cause. 

One problem with their application to gray-scale images is that they may 
cause an intensity drift in the resulting image. This is why after applying the fil- 
ter, our method replaces the value of each connected component with its average 
gray level, re-computed on the original (un-filtered) frame. 

3.2 Threshold Level 

In the upper frames, the brain area is made of a single, elliptical area. This 
area has no dark parts so our algorithm uses this a-priori knowledge to select an 
optimal threshold. 

Figure n shows the correlation between a pixel distance from the image cen- 
tre and its intensity, where the distance is placed along the abscissas and the 
intensity is placed along the ordinates (The origin, i.e. the upper left corner, 
shows the number of pixels having zero distance and zero intensity) . Since there 
are no dark points at distance 0 (close to the image centre) the threshold can 
be found by looking for the topmost pixel in the low distance range; this means 
finding the maximum ordinate of the connected component whom the pixel at 
co-ordinates (0, 0) belongs to. 

In order to detect when the learning phase must stop, the minimum distance 
of any point, whose intensity is within a pre-defined range for dark pixels, is 
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Fig. 1. The three stages for finding the threshold level. 



determined by finding the maximum abscissa of the same component containing 
the pixel at (0,0). The range is expressed thanks to a closing of the correlation 
image by a vertical line whose length is equal to the maximum value of the range 
for dark pixels; a dilation, on the other hand, by a horizontal line is required to 
limit the connected component at the topmost pixel within the range of allowed 
distances for bright pixels. 

When the learning stage is completed, the average of the threshold levels 
found up to that point is used as threshold level for all the following frames. 

3.3 Filtering Out Unrelated Components 

The thresholding operation alone is not able to segment the brain region. By 
looking at the correlation image (Figure [Da) it is clearly visible that the brain 
region, being the closest one to the image centre, contributes the pixels the are 
on the left side; selecting any threshold level that keeps those points will preserve 
some of the surrounding components as well. A new filtering stage is therefore 
mandatory. 

As long as the threshold selection is in the learning stage, it is possible to 
select the brain region simply by reconstructing the only connected component 
that covers the image centre (which is not be the visual centre of the brain 
region but it is close enough to allow a recovery of the correct region to the 
program). Figure El shows the boundary of the extracted region superimposed to 
the size- filtered image and to the original MR image. 
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Fig. 2. Boundary of the final segmented region. 



When the threshold selection switches to the extrapolating stage, then some 
extra care must be taken because of the possible presence of brain regions not 
connected to the main central area. The brain area generated by each frame, 
therefore, is used as a validation mask for the following frame (always going 
downwards): if the overlap between each of the current-frame regions and the 
validation mask is below a certain percentage threshold, regions are removed 
from the final brain area. Figure 0a shows in dark grey the outline of the brain 
area at the previous frame together with the current frame result in white; 
Figure 0b shows only the outline of the selected area for the current frame 
superimposed to the original image. 



4 Results and Discussion 

The method was developed using a processing environment that supplies an 
augmented data-parallel environment with direct support for connected compo- 
nents PI, and it was tested on several PD brain scans obtained from the Harvard 
Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html). 

The main problem arises when the brain region is kept connected to the 
bright surrounding border that survives thresholding. This case is easy to detect 
and may be solved by an apt number of successive erosions (that will be mirrored 
by an equal number of dilations after filtering by reconstruction). Even if this 
procedure reaches its goal, it has the unfortunate effect of removing some of 
fine details of the region. We are currently studying a data-parallel approach at 
finding the optimal cut for these extraneous paths. 

The results achieved by our method seem on a par with those reported in 
literature. Many methods take advantage of the “Snakes” algorithm m to refine 
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Fig. 3. Example of segmentation with multiple regions. 



the brain region border; by so doing they loose the ability to track thin features 
that go toward region centres. Our method tries to avoid any border correction, 
actually preserving thin details. 

5 Conclusion 

An automatic method for brain extraction from axial MR images was presented. 
An adaptive method for threshold selection and widespread use of connected 
filtering are the main novelties of this method. Results are on a par with other 
methods, future work will attempt at improving quality and robustness. 
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Abstract. In this paper, we make an overview of the existing algo- 
rithms concerning the discrete curvature estimation. We extend the 
Worring and Smeulders I W 16961 classification to new algorithms and we 
present a new and purely discrete algorithm based on discrete osculating 
circle estimation. 

Keywords. Discrete Geometry, Curvature Calculus, Discrete Circles. 



Introduction 



Boundary analysis is an important step in many computer vision paradigms 
in which contours are used to extract conceptual information from objects. In 
such a contour analysis, the curvature is a commonly used geometrical invariant 
to extract characteristic points. In 3D medical imaging or in 3D snow sample 
analysis, the curvature measurement on surfaces is also used as a registration 
tool |MLD94ITG92j or as a physical or mechanical characteristics extraction tool 
fPi^ . 



In classical mathematics, the curvature calculus is clearly defined and its 
properties are well known but when we want to apply this calculus on discrete 
data (2D or 3D discrete images), two different approaches are possible: we can 
first change the model of the data and put them into the classical continuous 
space by using interpolations or parameterizations of mathematical objects (B- 
splines, quadratic surfaces) on which the continuous curvature can be easily 
computed [Cha.flDIITHTShSj . Otherwise, we can try to express discrete curvature 
definitions and properties, and make sure that these new definitions are coherent 
with the continuous ones. 

In the first approach, we have two main problems: the first one is that there 
exists a great number of parameterization algorithms in which some parameters 
have to be set according to the inputs. In order to provide a given accuracy, we 
have to reduce the input area and thus to limit our method. The second problem 
is that these algorithms have got a prohibitif computational time when we use 
large input data such as in medical imaging. 
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In a discrete approach, there are three classical ways to define this differ- 
ential operator : we have the tangent orientation based algorithms, the second 
discrete derivative estimation based methods and finally the osculating circle 
based methods. In the continuous space, these definitions are equivalent but in 
the digital space, they lead to specific classes of algorithm. 

In this paper, we propose an optimal algorithm to compute the curvature of 
a discrete curve in 2 or 3 dimension based on circle estimation. This algorithm 
holds two important properties: it’s a purely low level algorithm that does not 
need neither preprocessing tool nor data linked parameters, and it only lies on 
the discrete model. 

1 Framework and Related Methods 

In this section, we present the different definitions of the continuous curvature 
we can find in the literature. We explain what they become on the discrete space 
and we extend the Worring and Smeulder ’s classification to new discrete 

algorithms. 

1.1 Continuous Definitions 

Given a continuous objet ft with boundary dX , we consider a curvilinear abscissa 
parametrization x{s) of the boundary. We have three classical ways to define the 
curvature of a curve or a path x. The first one is based on the norm of the second 
derivative of the curve. 

Definition 1 (Second derivative based curvature) 

k{s) = s* 5 n||a:"(s)|| 

where sign is either -1 or 1 according to the local convexity of the curve. 

We can also define the curvature using the directional changes of the tangent. 
The curvature is obtained by computing the angle variations between the tangent 
t of the curve and a given axis. 

Definition 2 (Tangent Orientation based curvature) 

k{s) = 0'{s) where 9{s) = /L{t{s) , axis) 

Finally, we have a geometrical approach of the curvature definition given by 
the osculating circle of radius r(s). 

Definition 3 (Osculating circle based curvature) 

k{s) = sign{^^) 
r[s) 

In a continuous space, all these definitions are obviously equivalent whereas 
in discrete space, they lead to specific algorithms. 
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1.2 Discrete Curvature Computation 

Now, we consider X as a digitization of the object df on a discrete grid of 
resolution h using a digitization process Dh {i.e. X = Dh{X)). The goal of this 
paper is not to study the best digitization process. Based on the discrete object 
X, we can define the boundary dX of X as the set of eight-connected points 
such that each of them have a four-connected neighbor in the complement of X 
[IKB.96] . 

We can now formulate the discrete curvature computation: given a discrete 
boundary dX how can we compute the curvature at each point? 

First of all, we detail existing algorithms that can be classified according 
to the previous definitions. Just remark that due to the difficulty to provide 
an accurate first derivative calculus of a discrete curve, the second derivative 
based methods are not really tractable. Indeed, Worring and al. have presented 
a definition of the curvature using second derivative at a point pi using derivative 
Gaussian kernels jWS93j . 



Tangent-Orientation based methods. A first definition of the tangent ori- 
entation based discrete curvature is computed using a local angle im)7,^IGM91l : 
given a neighborhood to, the discrete curvature at a point pi of dX is: 

1 / \ ^{Pi—mPiT PiPi+m) 

In Rosenfeld and al. proposed a new curvature calculus that adapts 

the window size to to local characteristics of the discrete curve. 

In such a way, Worring and Smeulders jWS93] proposed curvature definitions 
based on the variation of the angle between the best straight line fitting to the 
data and an axis in a given neighborhood of pp. 



kh{Pi) 



^{P^) * g; 

1.107 



where 0{pi) 



Z(Linefitting(pi, to), a; — axis) 



Worring and al. computed the line fitting using an optimization in a window 
of size TO. Q'ly corresponds to the derivative of a Gaussian kernel with parameter 
a to estimate the variation of angle. 

In (ran , Vialard proposed a purely discrete definition of the line fitting pro- 
cess. Let us remember the definition of a discrete straight line given by Reveilles 
pHev91| . A set of discrete points belong to the arithmetical discrete straight line 
D{a,b, pL,u>) (with a,b,c,p,uj € Z) if and only if each point p{x,y) of the set 
satisfies: 



p< ax — by<fi + uj 

a/b denotes the slope of the line, p, the lower bound on the grid and lo the thick- 
ness (in this paper, we only consider naive straight lines with oj = sup(|a|, |6|)). 
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Based on this definition, Vialard defines the discrete tangent at a point p of 
a curve as the longest discrete straight line centered in p that belongs to the 
curve. This calculus is driven by the Debeld’s straight line recognition algorithm 
Vialard defines the curvature at a point p of a discrete curve with the 
equation: 



kh{Pz) = where 9{p^) = Z{T{pi),x- axis) 

where T{pi) denotes the discrete tangent centered at pi. 

In Einni, Feschet and Tougne proposed an optimal algorithm to compute 
the tangent at each point of a discrete curve. 



Osculating Circle based Methods. In this approach, the estimation of the 
best fitting circle often leads to statistical or optimization processes. Worring 
[IWS93j proposed to optimize a distance between the smoothed data in a given 
window and Euclidean circles. In naj, Kovalevsky proposed a geometrical 
approach to compute the set of possible arcs (maybe empty) separating two dis- 
crete sets of points using a Voronoi-like algorithm. We can use this algorithm 
to solve our arc fitting problem with computing the separating circle between 
the complement of the discrete boundary which has two connected components. 
Obviously we can reduce the size of the two sets considering the points con- 
nected to the boundary (see figure E . The problem of this algorithm is that the 
computational cost of the calculus of the Voronoi cell in O(n^) (n denotes the 
size of the curve) makes this approach not be tractable as compared with the 
Vialard’s algorithm computational cost in 0(n). 




(a) (M (c) 

Fig. 1. Preprocessing to the Kovalevsky’s algorithm : (a) the discrete object X with 
its discrete boundary dX, (b) the two connected components of the complement of the 
boundary, (c) the resulting two sets which are the input of the Kovalevsky’s algorithm. 




Discussion. In the tangent orientation based approach, Vialard’s algorithm 
shows how the discrete model and discrete operators solve the statistical prob- 
lem of the line fitting process and so the curvature calculus in an optimal time. 
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However, such an algorithm needs a derivative filter and this leads to two prob- 
lems: we have to estimate a smoothing parameter a according to the data, and 
we use a non discrete operator. If we want to define in a purely discrete way the 
curvature on a discrete curve, the only possible approach is the geometrical one 
if we are able to recognize or estimate osculating discrete circles. 

2 A Purely Discrete Optimal Time Algorithm 

2.1 Discrete Circle Analysis 

First, we have to define and to characterize discrete circles. Andres |And94j 
proposed a discrete arithmetical circle definition based on a double diophantian 
inequation: p{x,y) belongs to the discrete arithmetical circle C{xo,yo,x) (with 
Xo,yo,x & if and only if it statisfies: 

(r - < (a: - Xof + {y ~ Vof < {r + (1) 

In the discrete straight line recognition algorithms, two main approaches 
can be extracted: the first one is a geometrical and a arithmetical approach 
iimiii.^koviioi roughly based on computing the narrowest strip that encloses 
the points. The second one is an algebraic approach Ihki'tlbl that solves the 
inequation system in Z" given by the double diophantian inequation at each 
point. Due to the non-linearity of the equation Q none of these two approaches 
can be easily extended to the discrete circle recognition problem in a similar 
computational time as the Vialard’s algorithm. Therefore the idea is to develop 
an approximation algorithm. 

In |DMT99j Tougne and al. present a new approach to define discrete circles: 
whatever the digitization process is, a discrete circle is the result of the inter- 
sections between the continuous circle x^ + y^ = and the discrete grid. In the 
first octant, if we analyze the intersections between a continuous circle and the 
vertical discrete lines x = r — k {0 < k < r), we obtain a bundle of parabolas 
Tik'. y = '/2kx + k'^. Hence, each kind of discrete circle (arithmetical circles, Pit- 
teway’s circles or Breshenam’s circles) is the result of a specific digitization of 
these parabolas (see figure |2I for an example of the discrete bundle of parabolas 
associated to the Pitteway’s circles) . 

Using this bundle of parabolas, a discrete circle can be built in the first octant 
only with vertical patches. For example, if we consider the case of Pitteway’s 
circles, the height of the first vertical patch is given by the formulsQ: 

h = [\/2aT+T] 

Just note that any kind of discrete circle can be entirely characterized with its 
bundle of discrete parabolas. 

In the following, since Tougne shown the equivalence of discrete circles given 
by their parabolas, we choose to estimate arithmetical osculating circle and base 

^ [*] denotes the rounding to the closest integer. 
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Fig. 2. The Pitteway’s circle of radius 20 and its discrete parabolas given by the equa- 
tion: y = [-\/2kx + k^]. 



our calculus on the discrete bundle of parabolas that generates the arithmetical 
circles. This bundle of parabolas can be computed considering the intersection 
between the Euclidean circles = {r — 1/2)^ and the grid, this leads to 

the bundle: 



y = 



V^r(l -b 2fc) - fc2 -b 1/4 



with k a positive integer 



2.2 The Algorithm 

We only use arithmetical circles due to their analogy to Euclidean rings: a dis- 
crete point belongs to an arithmetical circle if and only if it belongs to the 
Euclidean ring of radii r -b 1/2 and r — 1/2. Furthermore, the discrete tangent 
can be viewed as a digitization of the longest tangent of the circle of radius r 
that lies in the ring (see figure 0) • 

In the continuous space, it’s clear that the length of the inner tangent 
uniquely characterize the radius of the ring. In the discrete space, since we are 




Fig. 3. Continuous ring and the inner tangent. 



not able to recognize discrete circles, we have to make a link between discrete 
circles and well known discrete objects such as discrete straight lines. At this 
point of the discussion, we make the assumption that the length of the dis- 
crete tangent given by the Vialard’s algorithm is constant at each point of an 
arithmetical circle. In concrete case, we have a small variation of the tangent 
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length on a discrete circle but we will see that this variation does not interfere 
the results in the curvature calculus. Based on this assumption, we can propose 
a discrete osculating circle estimation: given a discrete curve, let I be the half 
length of the discrete tangent at a given point of the curve. Since the length of 
the discrete tangent on an arithmetical circle is constant, we can link the half 
discrete tangent to the first vertical patch given by the discrete parabolas asso- 
ciated to the arithmetical circles. Hence, by inverting the discrete equation of 
the first discrete parabola linked to arithmetical circle, we can compute the set 
Si of radii of discrete circles which have a first vertical patch of half length 1: 

Si — ) • ■ • ; ^ sup\ 

nuf=\{l- 1/2)^ -1/4] 

Tsup = L(^ + 1/2)' - l/4j 

Now, given this set of possible discrete circles, we can compute the possible 
curvatures by inverting the radii. In practise, we return as an estimation of 
curvature the invert of the mean radius of this set: 

kh{Si) = ^ 

^m/ “t” ^ sup 

Since the discrete tangent can be computed in an optimal time, we present, 
in algorithm ^ an optimal in time algorithm for the curvature computation of 
a discrete curve. 



Algorithm 1 Curvature Estimation 
Compute-curvature(Curve c) 

1: for all pixel p in c do 
2: Compute the tangent r in p 

3: Estimate the radii and nsing I = lengthir) (2 

4: return ( j ) 

5: end for 



2.3 Discussion 



In this section, we will discuss about the link between the grid of resolution h 
and our curvature calculus. Let us denote by k the curvature at the point x{s) 
of dX. We consider a digitization of dX on a grid of step h denoted by dX^.- 
Our algorithm estimates the curvature kh at the discrete point associated to 
s. A classical way in discrete geometry to justify a new discrete definition of an 
Euclidean tool is to prove the asymptotic convergence of the measure. Thus, if 
kh is the estimation on a grid of size h and k the expected Euclidean curvature, 
we must proof the following equality : 



liuih—^Q 



h 



= k 
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(a) 



(b) 





Fig. 4. Accuracy of the estimator: (a) Error in the curvature estimation (|l/r/i — kh\) 
when the digitization grid is re ned (b) comparison between the Vialard’s algorithm 
with O' = 1 and our estimator on a discrete circle of radius 40. 



This equality means that the error of the estimation converges to 0 when we 
refine the discrete grid. 

Since there is no formal proof of the convergence of the discrete tangent, we 
can only show an experimentally asymptotic convergence considering Euclidean 
disks and their discretization. Let R denotes the radius of an Euclidean disk 
X. The radius of the discrete approximating circle of dX is given by = 
[R/h], Decreasing h to 0, the estimated curvature kh should converge to 1/r/j. 
Instead of making h decreasing, we can compute the mean curvature of increasing 
discrete circles and check if the estimation error |fc/j — l/r?,| converges to 0. This 
experimental result can be found on figure 0-a. 



2.4 Results 

First of all, we have to check the accuracy of our estimation for a given step grid 
h. Hence, we compare our algorithm to the Vialard’s one on a discrete circle (see 
figure fflb). The results are quite similar but note that in our case, there is no 
smoothing process and our computation is purely discrete. We have also tested 
the accuracy in sense of localization of high curvature points (see figure 0 . 

Since Figueiredo and Reveilles |F^ have proposed an arithmetical defini- 
tion of 3D discrete lines based on the two dimensional ones: a 3D set of point 
is a 3D discrete straight line if and only if two of its three canonical projections 
on the grid planes are 2D discrete straight lines. Based on this definition Debled 
[IDB.9,^ has proposed an optimal time recognition algorithm of 3D lines. In the 
same way we can define a 3D discrete tangent using the recognition algorithm 
and thus we can use our algorithm to compute in an optimal time the curvature 
of a 3D discrete curve (see figure 0 . 
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Fig. 5. Curvature of an unfolded square. 





Fig. 6. An example of a 3D discrete curve (drak gray) composed of a circular arc and 
a straight lines, and its unsigned curvature graph. 



Conclusion 

In this article, we have extended the Worring and Smeulders ’s |W^ classifica- 
tion to new discrete algorithms and have presented a purely discrete and optimal 
curvature algorithm based on circle estimation that solves the arc fitting prob- 
lem. We have shown the good results of this algorithm compared to the Vialard’s 
algorithm which needs a smoothing parameter. Furthermore, we do not provide 
any filtering process to make our results stable. As future work, we will try to 
extend our method to 3D discrete surface curvature and thus give an useful tool 
for snow sample analysis |Fiefi!J| or medical imaging |MLD94| . 
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Abstract. This paper describes a method to enhance line structures in 
a gray level image. For this purpose, we blur the image using anisotropic 
gaussian filters along the directions of each line structures. In a line 
structure region the gradients of image gray levels have a uniform 
direction. To find such line structures, we evaluate the uniformity of 
the directions of the local gradients. Before this evaluation, we need to 
smooth out small structures to obtain line directions. We, first, blur the 
given image by a set of gaussian filters. The variance of the gaussian 
filter which maximizes the uniformity of the local gradient directions 
is detected position by position. Then, the line directions in the image 
are obtained from this blurred image. Finally, we blur the image using 
anisotropic filter again along the directions, and enhance every line 
structure. 

Keywords: Line structure enhancement, multi-resolution analysis, 
anisotropic filter, structure tensor 



1 Introduction 

Generally a figure in an image has local structure and global structure simul- 
taneously pp. For example, the figure in the image shown in the FigCpa) has 
characters locally and character strings globally. By a local operation, the global 
line structures of the character strings cannot be caught from this image. The 
purpose of the method proposed in this paper is detecting and emphasizing the 
global line structures to recognize the global image structure. For example, when 
the figure (a) of Figfflis given, we intend to obtain the figure (b) before applying 
the OCR technology to read out the string. For this purpose, we need to shape 
out an in-line distribution of local small gray-level profiles into a global simple 
line structure, as shown in (c). 

In order to enhance the global line structure, we must devise a method to dis- 
regard local small structures. Blurring off them by a gaussian filter is a common 
technique for this purpose |2|. However, as shown in FigEI for the case where 
line structures are not isolated in a image, we cannot achieve this by simple 
applications of a gaussian filters even though we must be careful to choose the 
proper size of the filter. 
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(c) 



Fig. 1. (a) Image including local structures and global structures, (b) Global structures 
in the left image, (c) An in-line distribution of local gray-level profiles(left) is shaped 
out to be a global linear profile (right) which has gradients of the same direction 
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Fig. 2. Upper left: the original image (151x133 pixels). The interval of character lines 
is about 15 pixels. Upper right, lower left, and lower right images are the results of 
blurring with gaussian filters with the sizes a — 1.0, 16.0, and 49.0, respectively. 



As shown in this figure, increasing the size of the gaussian filter turns out that 
small local structures disappear next to next. But, the gaussian filter of too large 
size will smooth out also the line structures themselves (see lower right figure of 
Figj2I). They will be blurred out under the influence of the neighbor structures. 
When the size is chosen well so as not to be influenced by the next neighbor line 
structures, it in turn could not enhance the line structure of character sequences 
(see lower left figure of Fig|2). 
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Thus, line structure cannot necessarily be enhanced by the straightforward 
application of the gaussian filters. We must employ the technique of diffusing 
the local gray-level only along the direction of the line structure. 

Some techniques to apply the gaussian filter having different sizes in its 
directions, that is, the shape of the gaussian filter is anisotropic, have been 
proposed P] 01^. These methods smooth out only in the direction of line struc- 
ture. But, these methods need knowledge about the direction of line structures 
in the image and also about the upper amount of the size of local structures 
to be smoothed out, in advance of the processing. In this paper, we propose a 
method to determine the proper parameters of the gaussian filter to smooth out 
only local small structures and enhance global line structures adaptively to a 
given image, and to each positions in the image. In this method, first, we apply 
blurring to the given image with some various size gaussian filters. Next, we 
evaluate the “line-likeness”, which expresses the similarity of the directions of 
the local gray-level gradients, position by position in the blurred image. And 
we determine the proper sizes and directions of the anisotropies of the gaussian 
filters for each position of the image. 

2 Multi-resolution Image Analysis 

For catching the global structure of an image, first, a device which disregards lo- 
cal small structures is needed. The most important consideration is to determine 
how small local structures must be disregarded for the ability to detect global 
structures. Disregarding the small structures can be achieved by employing an 
operation reducing the image resolution. But, we must impose the following two 
conditions to the operation. 

— No prior knowledge on the proper resolution of a given image to catch the 
global structure is needed. 

— When the image resolution is reduced by the operation, no new structure 
which did not exist in the original image will appear. 

The only operation which satisfies these requirements is the diffusion of the 
image f{x,y) according to the next differential equation. 



The solution of this diffusion equation just agrees with the result of the 
blurring of f{x, y) by the gaussian filter with the variance t miZ] , as 



dtu{x, y, t) = div{Vu{x, y, t)) 
u{x,y,0) = f{x,y) 



( 1 ) 



u{x,y,t) = f{x,y) * Gt 



(2) 



where 




( 3 ) 



and * means the convolution. 



316 K. Deguchi, T. Izumitani, and H. Hontani 




Fig. 3. (Left) Original image. (Center) and (right) Blurred images by the gaussian 
filters with t = 16 and 200, respectively. 



Resolution of u{x, y, t) becomes low gradually as t grows large. For example, 
the left figure in Fig0 becomes the center figure by blurring with a gaussian 
filter at t = 16, then the right figure at t = 200. With t being large, the ring 
structure which is the global line structure in the original image will appear, and 
then disappear. It is necessary to catch the proper moment when this global line 
structure appears, if we intend to recognize such the global shape of the figure. 

In the next section, then, we introduce a criterion of “line-likeness” to find 
the proper value of t. 



3 Evaluation of Line-Likeness 

In the neighborhood of a line structure on an image, the gradients (^, 
of the image gray-level f{x,y) have the same direction toward the center of the 
line, that is normal to the line structure. When the gradients of gray- levels have 
equal direction in a small neighbor region, the image is defined to have a line 
structure at the point. Then, we define line-likeness by the amount how similar 
the directions of the gradients of u{x, y, t) are in the neighbor of the image point. 

Three examples of figures having different line-likeliness are shown in FiggI 
The right-hand one has the structure which is more likely to line. 

To show this line-likeliness qualitatively, we introduce the gradient space 
which is spanned by ^ and The gray-level gradients at each pixel in the 
image of Fig0 (left), (center) and (right) distribute as shown in Fig0 (left), 
(center) and (right), respectively. These figures show that, when an image has 
line structure, its gray-level gradients distribute linearly in the direction normal 
to the image line structure. 

The deviation of this distribution and its direction are evaluated by the eigen 
vectors and the eigen values of the covariance matrix of the gray-level gradients 
of 

( SRfxfdxdy JJ fxfydxdy\ 

J{f{x,y))=\ (4) 

\n fxfydxdy jj{fyfdxdyj 
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Fig. 4. Examples of three gures with di erent line-likeliness 




Fig. 5. Distributions of gradients of gray-levels of images shown in Figs 



This covariance matrix J represent the total line-likeness of whole image. Then, 
in order to evaluate a local distribution of gray-level gradients, we introduce a 
structural-analysis tensor as 

/G(a;,?/,p2) * (/^)2 G{x,y,p^)* f^fy\ 
Jp{fix,y))=\ (5) 

\G{x,y,p^)* Ufy G{x,y,p^)*{fyf j 



where G{x,y, p^) is the gaussian function with the variance p^. 

The eigen vectors of this structural-analysis tensor at a position (a;, y) show 
the two principal directions of the gradient vectors in the neighborhood of (x, y), 
and the eigen values show the deviation of the distribution in those directions. 

The value of p determines the area of the neighborhood in which we evaluate 
the distribution of the gray-level gradients. When p — ?> oo, the structural-analysis 
tensor becomes equal to the covariance matrix of (0 . 

Then, we evaluate the line- likeness at a position (x, y) by the following S{x, y) 
in dOj) which is defined using the eigen values Ai and A 2 (Ai > A 2 ) of the 
structural-analysis tensor. 



S{x,y) = 



Ai — A2 



Ai + A2 

The value of S{x,y) spans in [0,1]. When S{x,y) 
(x,y) has a line-like structure, and when S{x,y) 



( 6 ) 

1, the gray levels around 
0, they have not. When 
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Fig. 6. Plottings of the value of the line-likeliness S{x, y) at the position indicated with 
X in the upper left image with respect to the change of the value of a. 



S{x,y) Ri 1, the direction of the eigen vector corresponding to Ai is normal to 
the direction of the line structure at the point. 



4 Multi-scale Evaluation 

For the evaluation of the line likeliness using the S{x,y), we must be careful 
to choose the value of p. To determine proper p, we employ a series of next 
two-step evaluations of the line-likeliness. As the first step, we blur the original 
image with a gaussian filter with variance a^. Then, we evaluate S{x,y) for all 
pixels. We apply this two-step evaluation next by next by changing the blurring 
parameter cr^. In every evaluation, we set p = 2cr. With larger value of a, we 
evaluate more global line-likeliness. 

FigElshows the change of the line-likeness at the position indicated with x in 
the upper left image with respect to the value of cr. The upper right image is the 
blurred image with the value of a with which the line-likeliness S{x, y) becomes 
the maximal. This shows that we just detect the global line structure of the 
original image by blurring it with the a which makes S{x,y) become maximal. 
At the same time, the directions of the line structure at every position can be 
also detected from the direction of the eigen vector of structural-analysis tensor 
corresponding to the eigen value of Ai. 
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Of course, we have the case where S{x,y) has multiple maximals. It should 
be noted that, usually, an image has several sizes of structure hierarchically. In 
our method, such a multi-level hierarchy of structures can be detected as a set 
of maximal of the line-likeliness S{x,y). In the next section, we show example 
images having the multi-level structures. They also shows, our method works to 
detect such the structures. 

5 Anisotropic Diffusion to Enhance Line Structure 

We just have shown the the global line structures could be detected by evaluating 
the line-likeliness S{x,y). However, the obtained result image was, as shown in 
upper right of Fig 0 a blurred one and the detected line structure was faded. 

Then, by using the eigen values and the eigen vectors which are corresponding 
to the detected line structure, we apply the blurring only along with the direction 
of the line structure. This process results in smoothing out gray level changes 
only within the line structure and enhancing it with clear contour edges. To blur 
an image only within a specific direction, so called the anisotropic diffusion of 
(0 has been proposed mum It diffuses an image according to a dynamical 
process expressed in 0 - 



where D(x,y) is a 2 x 2 matrix defined at every position in the image, so that 
its eigen values determine the degrees of the diffusion in the directions of its 
respective eigen vectors. Here after, we call D{x,y) as a diffusion tensor. 

When all the elements of the diffusion tensor D{x, y) belong to C°° with 
respect to x and y, and D{x,y) is positive definite, the solution of (0) corre- 
sponding to t always exists uniquely, and the solutions L{x,y,t) do not have 
new line structure which the original image has not. 

In this paper, we propose the determination of the suitable diffusion tensor 
to enhance line structures by using the evaluation of the line likeliness S{x,y). 
According to the results of the previous sections, those diffusion tensors will be 
obtained by defining the eigenvalues Ai and A2, and the corresponding eigen 
vectors Vi and E2 as follows. 

Let us denote the value a at (x,y) with which the line likeliness S{x,y) be- 
comes maximal with 170(2;, y)- The directions of line structures at each positions 
are given as the directions of the eigen vectors of the structural-analysis ten- 
sor Jp{f{x,y) : CTo(x,y)). Therefore, letting the eigen vectors be Vi{x,y) and 
V2{x,y) which are corresponding to Ai and A2, respectively, we define 



Figure Qshows an example of the emphasizing proposed here. The global line 
structure in the left image was enhanced without blurring its contour edge as 
shown in the right image. 



dtL{x, y, t) = div(D(a;, y)S/L{x, y, t)) 
Hx,y, 0 ) = f{x,y) 



(7) 



Vi=V2, Hi Ri 0 , 
I 2 = I’l , d.2 ~ 1 . 



1 



( 8 ) 
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Fig. 7 . (Left) Original image. (Right) Result image of the emphasizing of the global 
line structure 



Fig. 8. Anisotropic blurring of the image of FigEJ upper left) with the selected direc- 
tions and variances for each positions in the image. The line structures of the original 
image were enhanced 



6 Experimental Results 



We have apply the proposed method to many types of images. 

We showed in FigEla document image having rows of character strings. We 
also showed the original line structures of the strings could not be enhanced 
by an isotropic and uniform gaussian filters. By applying the proposed method 
and selecting proper directions and variances of the filters for each positions, we 
obtained the image of FiglHl The line structures were enhanced along with the 
strings using anisotropic and non-uniform blurring with the parameters of the 
maximal point of S{x,y) for every image positions. 

Next example shown in FigElhas multi-level line structures. Every elements 
of the image consist of dots in wave curved lines. More globally, those elements 
composes rows of line structures. Figure nTil shows S{x, y) at a point in this image. 
Almost all points in the image has such the two maximals as this point. Figure 
EH shows the results of the line-structure enhancements using the first and the 
second maximal points of the S{x, y), respectively. Two-level line structures were 
enhanced separately. 
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Fig. 9. An image having multi-level 
line structures. Every elements consist 
of dots in wave curved lines. More glob- 
ally, those elements composes rows of 
line structures. (216 x 256) pixels 



Fig. 10. The line likeliness S{x,y) at 
(185,97) with respect to a. S{x,y) has 
two maximal points of a 





Fig. 11. (Left) Small line structures were enhanced with the parameters based on the 
rst maximal point of S{x,y). (Right) Global line structures were enhanced based on 
the second maximal of S{x,y) 



The final example shows an application of line structure enhancement. Figure 
d is an image of a finger print corrupted with heavy noise. Figure M is the 
result of line structure enhancement by the proposed method using anisotropic 
and non-uniform blurring. Finger print pattern were clearly detected. 

7 Conclusions 

The technique of emphasizing the global line structure of a gray-level image 
was proposed. First, by carrying out multiple resolution analysis of the image, 
we obtained proper resolution to make global line structures appear for every 
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Fig. 12. A finger print image corrupted 
with noise 



Fig. 13. Enhancement of the line 
structures of Figd by the proposed 
method with anisotropic and non- 
nniform blurring 



positions in the image. Then, we obtained the direction of the line structures 
from the image of this resolution, applied the diffusion process to the image 
with the blurring along with the line structures. The original gray levels are 
only smoothed out in the direction of line structure, global line structures have 
been enhanced, without obscuring the outline of line structure. 
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Abstract. We consider the interactions between edges and intensity 
distributions in semi-open image neighborhoods surrounding them. Lo- 
cally this amounts to a kind of figure-ground problem, and we analyze 
the case of smooth figures occluding arbitrary backgrounds. Techniques 
from differential topology permit a classification into what we call folds 
(the side of an edge from a smooth object) and cuts (the background). 
Intuitively, cuts arise when an arbitrary scene is “cut” from view by an 
occluder. The condition takes the form of transversality between an edge 
tangent map and a shading flow field, and examples are included. 



1 Introduction 

On which side of an edge is figure; and on which ground? This classical Gestalt 
question is thought to be locally undecidable, and ambiguous globally (Fig. 
n^a) ). Even perfect line drawing interpretation is combinatorially difficult (NP- 
complete for the simple blocks world) ^31, and various heuristics, such as closure 
or convexity, have been suggested Nevertheless, an examination of natural 
images suggests that the intensity distribution in the neighborhood of edges does 
contain relevant information, and our goal in this paper is to show one basic way 
to exploit it. 

The intuition is provided in Fig. CKb). From a viewer’s perspective, edges 
arise when the tangent plane to the object “folds” out of sight; this naturally 
suggests a type of “figure”, which we show is both natural and commonplace. In 
particular, it enjoys a stable pattern of shading (with respect to the edge). But 
more importantly, the fold side of the edge “cuts” the background scene, which 
implies that the background cannot exhibit this regularity in general. 

Our main contribution in this paper is to develop the difference between folds 
and cuts in a technical sense. We employ the techniques of differential topology 
to capture qualitative aspects of shape, and propose a specific mechanism for 
classifying folds and cuts based on the interaction between edges and the shading 
flow field. The result is further applicable to formalizing an earlier classification 
of shadow edges p. 
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Fig. 1. (a) An ambiguons image. The edges lack the information present in (b), a Klein 
bottle. The shading illustrates the di erence between the fold , where the normal 
varies smoothly to the edge until it is orthogonal to the viewer, and the cut . (c) An 
image with pronounced folds and cuts. 



2 Folds and Cuts 

Figure-ground relationships are determined by the positions of surfaces in the 
image relative to the viewer, so we are specifically interested in edges resulting 
from surface geometry and viewing, which we now consider. 

Consider an image {I : Z <Z — ?> fR'*’) of a smooth (C^) surface S : X C 

— ?> F C IR^; here X is the surface parameter space and Y is ‘the world’. 
For a given viewing direction V £ (the unit sphere), the surface is projected 
onto the image plane by TTv : Y ^ Z C. IR^. For simplicity, we assume that 
n is orthographic projection, although this particular choice is not crucial to 
our reasoning. Thus the mapping from the surface domain to the image domain 
takes IR^ to IR^. See Fig.|21 




Fig. 2. The mappings referred to in the paper, from 
the parameter of a curve {U), to the coordinates of 
a surface (X), to Euclidean space (F), to the image 
domain (Z). 



Points in the resulting image are either regular or singular, depending on 
whether the Jacobian of the surface to image mapping, d(ilv ° is full rank 
or not. An important result in differential topology is the Whitney Theorem for 
mappings from IR^ to IR^ um, which states that such mappings generically 
have only two types of singularities, folds and cusps. (By generic we mean that 
the singularities persist under perturbations of the mapping.) 

Let Tx\A\ denote the tangent space of the manifold A at the point x. 

Definition 1. The fold is the singularity locus of the surface to image map- 
ping, TTv o E , where E is smooth. In the case of orthographic projection the fold 
is the image of those points on the surface whose tangent plane contains the view 
direction. 



"tfoid — {zp £ Z| V £ Ty^[E{Xf\, pp — E{xp), Zp — n^/{ijp)j 
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We denote the fold generator^ i.e. the pre-image of 7/oid on E, by 

rfoid = {Vp e F| Xp e X, V G Ty^[E{X)], yp = r(a;p)} 

Since the singularities of o S lead to discontinuities if we take Z as the 
domain, they naturally translate into edges in the image corresponding to the 
occluding contour and its end points. 

Note that due to occlusion and opacity, not all of the singularities present 
in a given image mapping will give rise to edges in the image. For example, the 
edge in an image corresponding to a fold actually corresponds to two curves on 
the surface: the fold generator and another curve, the locus of points occluded 
by the fold. We call this the contour shadow, 

Ep— shadow — \^Up G G H , yp — yq ~\~ t~\l , yq G 

Now suppose E is piecewise smooth. We now have two additional sources of dis- 
continuity in the image mapping: points where the surface itself is discontinuous, 

^boundary — \^Vp G G S , lim E(^Xp E(^Xpf yp — F/(a:p)} 

and points where the surface normal is discontinuous, 

Ecrease — {yp G F| 3(5 G S\ lim N{xp + eS) ^ N{xp), yp = E{xp)} 

6—^0 

Fig. El summarizes the points we’ve defined. 

Definition 2. The CUT is the set of points in the image where the image is 
discontinuous due to occlusion, surface discontinuities, or surface normal dis- 
continuities. 

'Jcut {^p G Z\ Zp G T['y(^Tp_ shadow U Eboundary U .^crease)} 

Note that y/oZd C 7c«t, while their respective pre-images are disjoint, except at 
special points such as T-junctions. 



Fig. 3. Categories of points of a mapping from IR^ to IR^: (1) a 
regular point, (2) a fold point, (3) a cusp, (4) a F-shadow point, 
(5) a crease point, (6) a boundary point. The viewpoint is taken 
to be at the upper left. From this position the fold (solid line) 
and the fold shadow (dashed line) appear aligned. 



If a surface has a pattern on it, such as shading, the geometry of folds gives 
rise to a distinct pattern in the image. Identifying the fold structure is naturally 
useful as a prerequisite for geometrical analysis mm- It is the contrast of this 
structure with that of cuts which is intriguing in the context of figure-ground. 
Our contribution develops this as a basis for distinguishing between jfoid and 

^CUt • 
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2.1 Curves and Flows at Folds and Cuts 

Consider a surface viewed such that its image has a fold, with a curve on the 
surface which runs through the fold. In general, the curve in the image osculates 
the fold (Fig. EJ. 




Fig. 4. A curve, E o a, passing through a point on the fold generator, Ffoid- (a) The 
tangent to the curve T[E o a], lies in the tangent plane to the surface, T, as does the 
tangent to the fold generator, T[rfoid\- (b) In the image, the tangent plane to the 
snrface at the fold projects to a line, and so the curve is tangent to the fold. 



Let a be a smooth (C^) curve on E; a : U C R ^ X . If a passes through 
point yp = S o a{up) on the surface then Ty^[S o a{U)\ C Ty^[S{X)\. An 
immediate consequence of this for images is that, if we choose V such that 
Zp = TTv(yp) £ 7/o/d, then the image of a is tangent to the fold, i.e. o 

Ea{U)]=T.,bfoUY)]. 

There is one specific choice of V for which this does not hold: V £ Ty^ [A o 
a{U)]. At such a point U o S o a{U) has a cusp and is transverse (non-tangent) 
to 7/o/d- 

Intuitively, it seems that the image of a should be tangent to 7 / 0 /d “most 
of the time”. Situations in which the image of a is not tangent to 7 / 0 /d result 
from the “accidental” alignment of the viewer with the curve. The notion of 
“generic viewpoint” is often used in computer vision to discount such accidents 
SHH We use the analagous concept of general position, or transversality, from 
differential topology, to distinguish between typical and atypical situations. 

Definition 3. m-- Let M be a manifold. Two submanifolds A, B G M are IN 
GENERAL POSITION, or TRANSVERSAL, j/Vp £ A (7 S, Tp[A] -|- Tp[B] = Tp[M]. 

We call a situation typical if its configuration is transversal, atypical (acci- 
dental) otherwise. See Fig. 0 

We show that if we view an arbitrary smooth curve, on an arbitrary smooth 
surface, from an arbitrary viewpoint, then typically at the point where the curve 
crosses the fold in the image, the curve is tangent to the fold. We do so by showing 
that in the space of variations, the set of configurations for which this holds is 
transversal, while the non-tangent configurations are not transversal. 

For the image of a to appear transverse to the fold, we need Ty^ [Eoa{U)] = V 
at some point yp £ Bfoid- T[E o a{U)] traces a curve in S^, possibly with self 
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(a) 



(b) 



(c) 



Fig. 5. Transversality. (a) A and B do not 
intersect, thus they are transversal, (b) A and 
B intersect transversally. A small motion of 
either curve leaves the intersection intact, (c) 
A non-transverse intersection: a small motion 
of either curve transforms (c) into (a) or (b). 



intersections. V however is a single point in S^. At T[Soa{U)] = V we note that 
Tv[T[A o a(J7)]] U IV[V] = T^/\T[E o a(J7)]] U 0 Tv[S^], thus this situation is 

not transversal. If T[E o a{U)] ^ V then T[E o a(U)] fl V = 0. See Fig. 6. This 
is our first result: 

Result 1 If, in an image of a surface with a curve lying on the surface, the 
curve on the surface crosses the fold generator, then the curve in the image will 
typically appear tangent to the fold at the corresponding point in the image. 

Examples of single curves on surfaces where this result can be exploited 
include occluding contours Hgpn and shadows 




Fig. 6. The tangent eld of a, C = T[E o a{U)], traces a curve 
in S^. When V intersects C, the curve a is tangent to the fold in 
the image. This situation (Vi) is not transversal, and thus only 
occurs accidentally. The typical situation (V2) is a tangent to the 
fold when it crosses. 



For a family of curves on a surface, the situation is similar: along a fold, the 
curves are typically tangent to the fold. However, along the fold the tangents to 
the curves vary, and may at some point coincide with the view direction. The 
typical situation is that the curves are tangent to the fold, except at isolated 
points on the fold, where they are transverse. 

Let A ■. (U,V) C -A X define a family of curves on a surface. As before, 
a curve appears transverse to the fold if its tangent is the same as the view 
direction: Ty^[E o A{U, V)] = V, and V is a point in S^. Now Tu[E o A{U, V)] 
is a surface in S^. The singularities of such a field are generically folds and 
cusps (again applying the Whitney Theorem), and so V does not intersect the 
singular points transversally. However, V will intersect the regular portion of 
Tu[E o A{U,V)], and such an intersection is transversal: T-v[Tu[^ ° = 

Tv[S^]. The dimensionality of this intersection is zero: and so non-tangency 
occurs at isolated points along 7/oZd- The number of such points depends on the 
singular stucture of the vector field uni This gives us: 
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Result 2 In an image of a surface with a family of smooth curves on the surface, 
the curves crossing the fold generator typically are everywhere tangent to the fold 
in the image, except at isolated points. 

Similar arguments can be made for more general projective mappings. Du- 
four 0 has classified the possible diffeomorphic forms families of curves under 
mappings from IR^ to can take; one feature of this classification is that the 
tangency condition just described is satisfied by those forms describing folds. 

For a discontinuity in the image not due to a fold, the situation is reversed: 
for a curve to be tangent to the edge locus, it must have the exact same tangent 
as the edge (Fig.C|). 




Fig. 7. The appearance of a curve intersecting a cut. (a) At a cut, the tangent plane 
to the surface does not contain the view direction. As a result there is no degeneracy 
in the projection, and so the cnrve will appear transverse to the cnt in the image (b). 



As before, we consider the behaviour of a curve a on a surface, now in the 
vicinity of a cut, 7™*. For TTv o A o a to be tangent to 7cut, we need o 

S o a{U)] = Tz^l'jcut], which only occurs when Ty^[S o a(U)] = Tx^lFcut], or 
equivalently Txp[alpha{U)] = o Fcut]- Consider the space x S^. a x 

T[a] traces a curve in this space, as does o Fcut x T[S~^ o Fcut]- We would 
not expect these two curves to intersect transversally in this space, and indeed: 
p€ a X T[a] n o Fcut x o Fcut] 'Fp[IR^ X S^]. 

Result 3 If, in an image of a surface with a curve lying on the surface, the 
curve on the surface crosses the cut generator, then the curve in the image will 
typically appear transverse to the cut at the corresponding point in the image. 

We now derive the analagous result for a family of curves A. For TTv o E o 
A{U, V) to be tangent to jcut, we need T^^[II o S o A{U, V)] = Tzpbcut], which 
only occurs when Ty^ [E o A{U, V)] = Ty^ [Fcut] ■ In x , Ax T[A] is a surface, 
and E~^ o Fcut x T\E~^ o T^] is a curve. The intersection of these two objects is 
transverse: p € Ax T[A] fl E~^ o Fcut x T[E~^ o Fcut] = 7 p[IR^ x S^]. See Fig. 8. 

Result 4 In an image of a surface with a family of smooth curves on the surface, 
the curves crossing the cut generator typically are everywhere transverse to the 
cut in the image, except at isolated points. 
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C X T[C1 A X T[AJ 




Fig. 8. A X T[A{U,V)], traces a surface in IR^ x while, 
letting C = o Fcut, C x T[C\ traces a curve. When the two 
intersect, the curves of A are tangent to the cut in the image. 
This situation is transversal, but has dimension zero. 



Thus, in an image of a surface with a family of curves on the surface, there are 
two situations: (fold) the curves are typically tangent to the fold, with isolated 
exceptional points or (cut) the curves are typically transverse to the cut, with 
isolated exceptional points. 

3 The Shading Flow Field at an Edge 

Now consider a surface E under illumination from a point source at infinity in 
the direction L. If the surface is Lambertian then the shading at a point p is 
s{p) = N ■ L where N is the normal to the surface at p; this is the standard 
model assumed by most shape-from-shading algorithms. We define the shading 
flow field to be the unit vector field tangent to the level sets of the shading 
field: S = ]|^j|| (— fy)- The structure of the shading flow field can be used to 
distinguish between several types of edges, e.g. cast shadows and albedo changes 
IP. Applying the results of the previous section, the shading flow field can be 
used to categorize edge neighborhoods as fold or cut. 

Since E is smooth (except possibly at Fcut ) , N varies smoothly, and as a result 
so does s. Thus S is the tangent held to a family of smooth curves. Consider S 
at an edge point p. If p is a fold point, then in the image S(p) = Tpl'yfoid]. If p 
is a cut point, then S(p) Tp['ycut]- (Fig. Ej) 

Proposition 1. At an edge point p £ ^ in an image we can define two semi-open 
neighborhoods, and , where the surface to image mapping is continuous 
in each neighborhood. We can then classify p as follows: 

1. FOLD-FOLD.' The shading flow is tangent to 7 in Np and in Np , with ex- 
ception at isolated points. 

2. FOLD-CUT.' The shading flow is tangent to "f at p in Np and the shading 
flow is transverse to T at p in Np , with exception at isolated points. 

3. CUT-CUT.' The shading flow is transverse to "f at p in N^ and in N^ , with 
exception at isolated points. 

Figs. E3 El and m illustrate the applicability of our categorization. 



3.1 Computing Folds and Cuts 

To apply our categorization requires the computation of the shading flow in 
the neighborhood of edges. This presents a problem, as both edge detection and 
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Fig. 9. The shading flow field at an edge. Near a fold (a) the shading flow field becomes 
tangent to the edge in the image (b). At a cut (c), the flow is transverse (d). 



shading flow computation typically involve smoothing, and at discontinuities the 
shading flow will be inaccurate 0 The effects of smoothing across an edge must 
either be minimized (e.g., by adaptively smoothing based on edge detection), 
or otherwise accounted for (e.g., relaxation labeling For simplicity only 
uniform smoothing is applied; note that this immediately places an upper limit 
on the curvature of folds we can discern (consider how occluding polyhedral edges 
may appear as folds at high resolution). This raises the question as to what 
the appropriate measurement should be in making the categorization. In the 
examples shown, the folds are of greater extent than the applied smoothing Alter 
(see Fig. m), and so our classification is applicable by observing the shading 
flow orientation as compared to edge orientation (a Alter based on orientation 
averaging suffices); in cases where high surface curvature is present, higher order 
calculations may be appropriate (i.e. shading flow curvature). 



4 Conclusions 

The categorizations we have presented are computable locally, and are intimately 
related to figure-ground discrimination (see Fig. 1 1 2j) . Furthermore, the advantage 
of introducing the differential topological analysis for this problem is that it is 
readily generalized to more realistic, or even arbitrary, shading distributions. For 
example, shading that results from diffuse lighting can be expressed in terms of 
an aperture function that smoothly varies over the surface m meeting the 
conditions we described in Section 2, thus enabling us to make the fold-cut 
distinction. The same analysis can be applied to texture or range data. 

^ This also affects the gradient magnitude which can be used to aid our classification. 







Fig. 11. (a) A scene with folds and cuts (a close-up of Michelangelo’s Pieta). (b) The 
shading flow field in the vicinity of an occlusion edge (where the hnger obscures the 
shroud). Observe how the shading flow is clearly parallel to the edge on the fold side 
of the edge, (c) A fold-fold configuration of the shading flow at a crease in the finger. 
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Fig. 12. Shaded versions of the ambiguous figure from FigQ after KanizsaQ- In (a) 
the shading is transverse to the edges. In (b) the shading is approximately tangent to 
the edges. Notice how flat the convex blobs in (a) look compared to in (b). In (c) both 
types of shading are present. 
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Abstract. In this paper, we present an incremental algorithm to effi- 
ciently extract topological featnres of deformable objects from a given 
sequential volume data. Even though these features are global shape in- 
formation, we show that there are possibilities that we can extract them 
by local operations avoiding global operations. 



1 Introduction 

In this paper, we consider a sequence of 3D digital images is given as volume 
data of objects. From such a sequential volume data, we present an algorithm 
to efficiently extract the shape information of the objects of interest. Shape 
information is mainly divided into geometric and topological features and we 
focus on the latter ones such as a number of objects, numbers of cavities and 
tunnels for each object. Topological features are global shape information and 
some algorithms for their calculation using the entire volume data at a moment 
are already presented [TEffT| . In this paper, since we deal with a sequence of 
volume data, we use the topological features at a previous time slot which are 
already calculated for those at a current time slot. In other words, we obtain 
their difference equations with respect to time. If the shape difference caused by 
deformation is small, we expect that global calculation for topological features 
can be local calculation at the part where small deformation occurs. 

For such topological feature extraction, we need a deformable object model 
and its deformation model. Several models with consideration on topological 
changes are already studied for segmentation of 3D digital images |4l5j . Those 
models use geometric constraints and image density constraints to change topol- 
ogy of object surfaces. Therefore, it is possible to change topologies of objects by 
using their geometric and image density information though it is impossible to 
change geometries of objects by using their topological information. Since one of 
our final goals is 3D shape analysis by integration of geometric and topological 
information, we look for models such that topological features can be treated 
more explicitly. In thinning algorithms, topology-based deformable object and 

* The current address of the first author is Laboratoire A2SI, ESIEE, Cite Descartes, 
B.P. 99, 93162 Noise-Le- Grand Cedex, France thanks to JSPS Postdoctoral Fellows- 
hips for Research Abroad from 2000. 



C. Arcelli et al. (Eds.): IWVF4, LNCS 2059, pp. 333-^4^] 2001. 
© Springer- Verlag Berlin Heidelberg 2001 



334 



Y. Kenmochi et al. 



deformation models are actually considered m- However, their deformation re- 
duces the dimensions of objects of interest such as from three to one dimension, 
while our deformation preserves the dimensions of objects since we consider any 
3D object keep being a 3D object in a sequence of volume data. 

In order to consider the topologies of objects including their dimensions, we 
take the approach of combinatorial topology [S| for our deformable object model 
and its deformation model. We first introduce a polyhedral representation based 
on discrete combinatorial topology P| and show that this representation enables 
us to describe deformation of 3D objects by set operations of polyhedra. We then 
present our difference algorithms for calculation of topological features from a 
sequential volume data by using set operations. 



2 Polyhedral Representation of 3D Discrete Objects 

Setting Z to be the set of integers, becomes the set of points whose coordinates 

are all integers in the three-dimensional Euclidean space R^. Those points are 
called lattice points and a set of volume data is given as a finite subset of Z^. 
Our polyhedral representation is based on the approach of discrete combinatorial 
topology 0. We first define primitives of polyhedra in Z^ and then define general 
polyhedra as combination of those primitives. To combine the primitives, we 
define set operations since our polyhedron is described by a set of polygons. 

2.1 Primitives of Discrete Polyhedra 

This subsection is devoted to defining sets of primitives of n-dimensional poly- 
hedra whose vertices are all lattice points. They are called n-dimensional unit 
discrete polyhedra. Setting a unit cubic region 

D(i, j, fc) = {(i -I- ei,j + 62 ,k + € 3 ) : e/ = 0 or 1, / = 1, 2, 3} , 

we consider that each point in k) has a value of either one or zero. Such 

a point whose value is one or zero is called a 1- or 0-point, respectively. For 
every arrangement of 1- and 0-points in D(i,j, A:), a convex hull of all 1-points 
is obtained. The dimension of a convex hull depends on the number denoted by 
p and arrangement of 1-points in D(i, j, k). For instance, a convex hull becomes 
a 0-dimensional isolated point when p = 1, a 1-dimensional line segment when 
p = 2, and a 2-dimensional triangle when p = 3. When p = 4, it becomes a 2- 
dimensional rectangle if all 1-points lie on a plane, and otherwise a 3-dimensional 
tetrahedron. When p > 5, it becomes a 3-dimensional polyhedron with p vertices. 

Definition 1. If the convex hull of p 1-points Xi, X 2 , . . Xp in D(j, j, k) has n 
dimensions where n = 0, 1, 2, 3, then {xi, X 2 , ■ ■ ■ , Xp} is called an n-dimensional 
unit discrete polyhedron. 

In Z^, a set of the points neighboring a point x is defined as 
Nm{x) = {y : \\x - y\\ < Vi} 
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Table 1. All 2- and 3-dimensional unit discrete polyhedra for the 6-, 18- and 26- 
neighborhood systems. We omit unit discrete polyhedra which di er from those in the 
table by rotations. 




where t = 1,2,3 for m = 6,18,26. Those neighborhood systems are called 6-, 
18-, 26-neighborhood systems, respectively. If any pair of adjacent vertices which 
are connected via an edge of a unit discrete polyhedron are m- neighboring, such 
a unit discrete polyhedron is classified into the set of unit discrete polyhedra 
for the m-neighborhood system. In this paper, we focus on the dimensions of 
n = 2 and 3. Hereafter, we refer to 2- and 3-dimensional unit discrete polyhedra 
as unit discrete polygons and polyhedra, respectively. Table Q shows all unit 
discrete polygons and polyhedra defined for each neighborhood system. A unit 
discrete polygon consisting of p points X\, X 2 , . . Xp is denoted by 

S = {XI,X2,. . ■ ,Xp} . 

Because a unit discrete polyhedron is bounded by a set of unit discrete polygons, 
it is denoted by 

P = {Si,S2,...,SJ. (1) 

Each Si = {xii,Xi 2 , ■ • • , Xip} for i = 1, 2, . . . , q is oriented such that the sequence 
of Xii^ Xi 2 , . . . , Xip has a counterclockwise order from a viewpoint exterior to P. 

2.2 Recursive Definition of Discrete Polyhedra and Set Operations 

Let V be the family of sets of oriented unit discrete polygons. For any pair of 
finite sets A and B in P, we consider 

Xa(B) = {Sg A : S = T ^ for any T G B} , 

where the notation S“^ represents a unit discrete polygon whose orientation is 
opposite to that of S and the relation S = T means that S and T are equivalent 
oriented unit discrete polygons. Using Xa(B) C A and Xb(A) C B, we define 
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P Q P + Q = (P U Q) \ {S, T} 

Fig. 1. An example of addition between two unit discrete polyhedra P and Q for the 
26-neighborhood system. 



the addition operation with the notations of U and \ which are the union and 
difference sets respectively, such that 

A + B = (A\Xa(B))U(B\Xb(A)) . (2) 

Since the empty set is regarded as the neutral element for any A G V such that 

A-|-0 = 0-hA = A , 

the inverse element for any A G V is defined as 

-A = {S"i : S e A} 

so that 

A+(-A) = (-A)-bA = 0. 

We then define the subtraction for A and B in P such that 

A - B = A + (-B) = (A \ B) U (-(B \ A)) . 

Let us consider the addition of two unit discrete polyhedra P and Q in V. 
Figured illustrates an example of P -I- Q in the 26-neighborhood system. After 
combining P and Q at their common faces of S G Xp(Q) and T G Xq(P), 
we exclude all S and T from a union of P and Q and obtain P -h Q. If we set 
P' = P -h Q in Figured then we also see P' — Q = P. By using the addition 
operation given in 0, we define the set of discrete polyhedra from the finite set 
of unit discrete polyhedra for each m-neighborhood system. 

Definition 2. Discrete polyhedra P^n o,Te recursively constructed for each m = 
6, 18, 26 as follows: 

1. a unit discrete polyhedron for an m-neighborhood system is considered to be 
a discrete polyhedron Pm, we set 

W(Pm) = {D(a:) : U S C D(a;)} ; 

m 

2. if Pm and Am are, respectively, a discrete polyhedron and a unit discrete 
polyhedron such that they satisfy the following conditions: 

a) W(Pm)nW(Am) =0; 
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b zf titlG'T'G' GXISti (X pOiXT' 0^*8 € fji (XTld T' ^ XTl D(a:) n D(y) so that 
D(a;) e W(P„) and D(y) e W(A^), then S = T ^ , 
then = Pm + Am becomes a discrete polyhedron, and we set 

W(P'm) = W(Pm)UW(Am) . 

From Definition 13 we see that a discrete polyhedron Pm for m = 6, 18, 26 is 
constructed by combining unit discrete polyhedra A^s in all D(a:) G W(Pm) 
one by one. 

2.3 Discrete Polyhedron Construction from Volume Data 

Let V be a set of volume data which is a finite subset of Z^. Setting all points 
in V to be 1-points, \ V becomes the set of all 0-points in Z^. Considering a 
discrete polyhedron Pm(V) constructed from any V, we obtain 

Pm(V)= + Pm(VnD(x)) (3) 

D(ic)GW 

where 

W = {D(a;) : D(a;) n V 9, x € Z^} . 

Each Pm(V n D(x)) for D(a;) G W represents a 3-dimensional unit discrete 
polyhedron which is constructed with respect to 1-points of V in D(a:) as shown 
in Table E The following theorem is proved in j0| . 

Theorem 1. For any given V, we can uniquely construct Pm(V) for each m = 
6, 18, 26 by 

In jS], the efficient algorithm which directly constructs Pm(V) from V is 
also presented. From theorem ^ we have the proof of unique construction of 
Pm(V) for each V in a sequence. In other words, we obtain a unique sequence 
of Pm(V)s corresponding to a sequence of Vs. In the following sections, we omit 
m of Pm(V) because the following discussion is common for any m. 

3 Deformation of Polyhedral Objects 

3.1 Deformation Description by Set Operations 

Deformation of a discrete polyhedron P is mainly classified into two types of 
simple deformation. If P( and Pt+i are discrete polyhedra before and after de- 
formation, respectively, then two types of deformation from P( to Pj+i are 
described using the addition and subtraction operations such that 

Pi+i = Pt + AP, (4) 

P*+i = Pt-AP, (5) 

where AP is a discrete polyhedron which is a difference between P^ and Pt+i. 
Equations and (0) are called expanding and shrinking, respectively. 
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3.2 Polyhedral Deformation from Sequential Volume Data 

Any deformation of discrete polyhedra P(Vi)s constructed from a sequence of 
volume data V^s is described by a combination of expanding and shrinking oper- 
ations and ©• First, we consider two kinds of such combinatorial deformation 
for the cases of adding a point a; to Vt and removing x from V(, respectively. 

For the deformation such as adding (resp. removing) x to (resp. from) V(, 
we should consider only the region such as the union of 

Ea; = {D(y) : x S D(y)} 

which consists of eight unit cubes D(y)s around x because such deformation 
affects only the polyhedral shape at the union of E,c- By adding x G \ Vj 
to V(, P(Vi) is expanded to P(Vt U {a;}) which is also uniquely determined by 
such that 

P(Vt U {a;}) = P(Vt) - ZlPi + AP2 (6) 

where 

APi= + P(V*nD(y)) 

D(j/)gE,„ 

AP2= + P((V*U{a;})nD(y)) . 

D(y)GE„ 

In we first subtract a unit discrete polyhedron Z\Pi in Ex from P(Vt) and 
then add the replacement Z\P2 reconstructed in Ea, after adding a: to Vj. 

Similarly, an operation for shrinking P(Vt) to P(Vt \ {a;}) by removing x 
from V( is described by 

P(V* \ {a:}) = P(Vt) - APi + AP3 (7) 

where 

APs= + P((VA W)nD(y)) . 

D(y)GEa; 

We execute each deformation of ® and m in two stages such as subtraction 
and then addition. We do not calculate the differences such as — Z\Pi -|- AP2 
and — Z\Pi -1- AP 3 first because there is no guarantee that the differences can 
be calculated. In order to obtain the guarantee such that the calculation result 
becomes a discrete polyhedron, —APi and AP2 (resp. ZIPs) need to satisfy the 
similar conditions to those in Definition 0 and it is easy to find some counter 
examples that do not satisfy the conditions. To maintain the uniqueness property 
of P(V( U a;) and P(V( \ a;), we need the two-stage procedures for (0 and ( 0 ). 

Given a sequence of volume data Vi, V2, . . ., we set Z\V( to be the difference 
between V( and Vt+i. If either (0) or dzj is manipulated for every point in AV t 
in order, consequently we obtain P(Vt+i) from P(Vt). Note that P(V(_|_i) is 
independent of the order of points chosen from Z\V( because of the uniqueness 
property in Theorem 0 Thus, given a sequence of Vi, V2, . . ., we obtain a 
unique sequence of P(Vi), P(V2), ... incrementally by ( 0 ) and 0 ) and each 
P(V() is equivalent to the unit polyhedron directly constructed from Vj by ( 0 ). 
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Fig. 2. The calculation procedure for topological features of a polyhedral object. 



4 Topological Feature Extraction during Deformation 

In this paper, we consider topological features such as the number of objects in 
volume data V( and the numbers of cavities and tunnels in objects. To calculate 
these numbers from our polyhedral representation P(V(), we make use of the 
number of connected components m and the Euler characteristics ITEin of 
P(V(). The calculation procedure of them is derived from their relations and 
illustrated in Figure 0 In this section, assuming that we already have a sequence 
of P(Vi), P(V2), . . . which are obtained from a sequence of Vi, V2, . . . as shown 
in the previous section and also the values of topological features for a P(Vt), 
we calculate the values of topological features for P(Vi+i). Consequently, in this 
section we try to obtain the deference equations for calculation of the values of 
topological features for P(Vt+i). 

4.1 Connected Component Number 

The number of connected components for any set in V which is the family of 
sets of oriented unit discrete polygons is defined. 

Definition 3 . For any pair of S and T in A G V , if there exists a path such 
that 

Si = S,S2,...,Sa = T 

where every Si G A and Si fl Si+i 7^ 0 , then A is connected. 

The definition is different from that in 0. While the connectivity is considered 
between points in |2|, we consider that between unit discrete polygons and it 
is based on the definition of connectivity in combinatorial topology |M9j . To 
decompose P(Vi) into connected components, we apply an algorithm in |2| and 
set 6c(P(Vi)) to be the initial number of connected components in P(Vi) for 
the sequence V(. 

Let us assume that 6c(P(Vt)) is given and we calculate 6c(P(V(_|_i)) from 
6c(P(Vt)). During the deformation from P(Vf) to P(V(+i), there may be con- 
nectivity changes in a small local area around the points in Vj+i \ V(, such as 
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Fig. 3. The changes of global connectivity cansed by connectivity changes in a small 
local area, such as joining ((a), (b)) and separating ((c), (d)). 



joining and separating as shown in Figured The similar local changes are con- 
sidered in and they are called melting and constriction from the topological 
sense. For joining, we have two possible cases for 6c(P(Vt+i)) such that 

&e(P(V*+i)) = 6,(P(V*))-l, (8) 

&,(P(Vt+i)) = 6c(P(V*)) . (9) 

Equation (0 describes two different components in P(Vi) are joined into one as 
shown in Figure 01 (a), and (|S|) describes two different parts in one component are 
joined as shown in FigureOI(b). To distinguish between these cases (|BI) and (|S|), 
we need to know if two joining parts are in the same component or not and such 
information is automatically obtained if we have information on decomposition 
of P(Vi) by connectivity. 

For separating, we also have two possible cases such that 

6,(P(V*+i)) = 6,(P(V,)) + 1 , (10) 

6,(P(Vt+i)) = 6,(P(V,)) . (11) 

Equation (cni) describes that one component is separated into two as shown 
in Figure 0 (c). In (ini), even though one component is separated locally, the 
component is still joined at some other part as shown in Figure 0(d). In any 
case of local separation, therefore, we need to distinguish between dini) and (ED 
by a global operation, which is similar to that for decomposition by connectivity 
0, to check if two separated parts are still in the same component or not. 

If there is no local connectivity change before and after the deformation 
from P(V() to P(V(+i), we keep the connected component number such as (0 
and (ED- The local connectivity change is checked by using local connectivity 
conditions which is similar to simple points for the thinning algorithms m 
rather than by using geometric conditions |S|. We currently study to clarify the 
relations between our conditions for the local connectivity change and the simple 
point conditions. In this paper, however, we do not give the further discussion 
and leave this topic for our future work. Note that the local connectivity change 
is checked only in the union of Ea;S for all x in Vj+i \ V* and the algorithm 
complexity will be 0{k) where k is the number of points in V(+i \ V^. 

4.2 Object and Cavity Numbers 

From m, p(Vt+i) is given as a set of unit discrete polygons and it represents 
object surfaces. Because P(V 4 +i) does not contain the inside structure of ob- 
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jects, connected components in P(Vt+i) are classified into two types of object 
surfaces: surfaces which face the exterior space and surfaces which face the cav- 
ity space. These surfaces are called exterior and interior surfaces, respectively. 
We see that numbers of exterior and interior surfaces correspond to numbers of 
objects and cavities, respectively. Let 6o(P(V(_|_i)) and &2(P(V(+i)) be numbers 
of objects and cavities in Vj+i. Then, we have the following relation 

5e(P(V,+i)) = 5o(P(Vt+i)) + &2(P(Vt+i)) . 

For each component C in P(Vt_|_i), we distinguish between exterior and 
interior of surfaces by Algorithm Q] After classifying all C in P(Vt+i) into two 
sets of interior and exterior surfaces, we obtain 6o(P(Vt+i)) and 52(P(Vt+i)). 
Note that 6o(P(Vt+i)) and 62(P(Vt_|_i)) are obtained only for P(V4+i) which 
have had a new connectivity structure in the previous subsection. 

Algorithm 1 

input: C. 

output: 0 (interior) or 1 (exterior). 

begin 

1. Setting x which is a point further away from all points in P(Vt_|_i) and 
y in C, draw a line I from x to y; 

2. choose a point z in C which is nearest to x lying on 1; 

3. find a unit discrete polygon S which includes z and set the normal vector 
n which is oriented to the exterior o/P(Vt_|_i) for S; 

4- if {y ~ x) ■ n > Q, then return 0 ; 

5. else if {y — x) ■ n < 0, then return 1; 

6. else if {y — x) ■ n — 0, then go hack to step 1 and choose another x. 

end 



4.3 Euler Characteristics 

There are two Euler characteristics for polyhedral objects in R^: Eg which is 
based on triangulation of polyhedral surfaces and Ey which is based on tetra- 
hedrization of polyhedral objects m- Since P(V() is a surface representation, 
only Eg is calculated from P(V() for any t such that 

Eg{P{Vt)) = So(P(V0) - Si(P(V,)) + S2(P(V0) (12) 

where so(P(Vt)) and si(P(Vt)) denote the numbers of 0- and 1-dimensional unit 
discrete polyhedra in the union of discrete polygons of P(V() and S2(P(V()) 
denotes the number of all unit discrete polygons in P(V(), respectively. For 
calculation of the number 6i(P(Vt)) of tunnels in the next subsection, however, 
we use the relation |2| 



A„(P(V*)) = bo{P{Vt)) - 6i(P(V*)) + 62(P(V*)) . 



(13) 
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Fig. 4. Two kinds of connection in a discrete polyhedron for which the relation If 1 411 is 
not established. 




Fig. 5. Parts of a sequence of discrete polyhedra constructed from a sequence of 109 sets 
of volume data which is used for an experiment. The numbers in the gures correspond 
to the time slots in a sequence. 



Since we already have 6o(P(Vt)) and 62(P(Vi)) in the previous subsection, 
i5„(P(Vi)) is only required for &i(P(Vt)) from (ITT^ . 

There is an important relation between ifs(P(Vt)) and £’„(P(Vi)) which is 
introduced in |2j; 

i;,(P(Vt)) = 2E,{P{Vt)) . (14) 

However, this relation is not satisfied when P(V() contains connection of unit 
discrete polyhedra as illustrated in Figure^ In this paper, therefore, we consider 
only simple polyhedra, namely 2-manifolds, which do not have such connection 
in P(Vt) to make use of (1141) . 

From (O and (tT^ . the difference equation for i?j,(P(Vt)) is obtained by 
S,(P(Vt+i)) = S,(P(V*)) + \{Aso - Asi + As2) (15) 

where for n = 0, 1, 2 

Zls„ = s„(P(Vt+i))-s„(P(V*)) . 



4.4 Tunnel Number 

From (m, we obtain 

6i(P(Vt+i)) = 6o(P(Vt+i)) + &2(P(Vt+i)) - if„(P(Vt+i)) 

= hi{V{Yt)) + Abo + Ab2-AE, (16) 

where for z = 0, 2 



Z\6i = 6,(P(V*+i))-6,(P(V0) , 

AE, = E,{P{Yt+i)) - E^{P{Yt)) . 
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4.5 Algorithm 

Summarizing the section, we present an incremental algorithm for topological 
feature extractions of P(Vt_|_i) given those of P(Vt) following the procedure as 
illustrated in Figure O 

Algorithm 2 

input: 6c(P(V*)), 6*(P(V*)) fori = 0,1,2, E„{P{Vt)), P(V*), P(V*+i). 
output: 6e(P(Vi+i)), 6,(P(V*+i)) fori = 0,1,2, £l„(P(V*+i)). 

begin 

1. Around the points in Vj+i \ Vj, check the local connectivity change (see 
subsection mi for more information); 

2. if joining is seen at the local regin, then 

2.1 if joining is the type of Figure (a), apply 0) for 6,(P(V*+i)); 
otherwise, apply m; 

2.2 obtain 6o(P(Vt+i)) and 62(P(Vt+i)) by AlgorithmWi 

3. else if separating is seen at the local regin, then 

3.1 if two separated parts are in the same component (global check; see 
subsection \4.1[ ), apply CH) for bc(P (V t+i)) ; otherwise, apply /TTTI) ; 

3.2 obtain 6o(P(Vt+i)) and 62(P(Vt+i)) by Algorithm^ 

4-. else if no local change is seen, 6c(P(Vt+i)) = 6c(P(Vt)), 6o(P(Vt+i)) = 
6o(P(VO) and b 2 {P{^t+i)) = &2(P(V*)); 

5. obtain if„(P(Vt+i)) from 477111 ; 

6. obtain 6i(P(V(+i)) from llh\l . 

end 



We see that only step 3.1 requires the global operation such as checking the 
whole P(Vt_|_i) and the worst complexity will be 0{P) where I is the number 
of discrete polygons in P(Vi+i). If we will have an efficient data structure for 
discrete polyhedra by using the spatial location information of discrete polygons 
in a discrete polyhedron, it may be possible to reduce the complexity; the further 
discussion will be left for our future work. Since the complexity of every step in 
Algorithm 0 except for step 3.1 will be 0{k) where k is the number of elements 
in Vt_|_i \ V(, if no local separation occurs during the deformation, the algorithm 
works very efficiently by using only local operations. 

5 Experimental Results 

We apply Algorithm El to extract the topological features from a sequence of 
109 sets of volume data. Note that we also use (0) to obtain a sequence of 
expanding discrete polyhedra from the sequence of volume data. Expansion of 
discrete polyhedra in the sequence is shown in Figure 0 The calculation results 
of topological features are shown in Figure 0 In this experiment, we succeeded 
to avoid step 3.1 in Algorithm 0 namely global operations. 
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Fig. 6. The topological feature changes of a sequential volume data given in Figure El 



6 Conclusions 

In this paper, we have presented a procedure for extraction of topological fea- 
tures from a sequential volume data using a polyhedral object representation 
and its set operations. In the steps of the procedure, we have obtained the dif- 
ference equations for calculation of the values of topological features based on 
the relations between the features. We have also shown the experimental results 
for a sequential volume data in which deformation of objects cause the changes 
of their topological features. 

The authors thank to the reviewers for their helpful comments to improve 
the paper. The fist author also thanks to Prof. Gilles Bertrand and Prof. Michel 
Couprie at ESIEE for the fruitful discussions on the topic. A part of this work 
was supported by JSPS Grant-in- Aid for Encouragement of Young Scientists 
( 12780207 ). 
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Abstract. This paper addresses the problem of enhancement of noisy 
scalar and vector fields, when they are known to be constrained to a 
manifold. As an example, we address selective smoothing of orientation 
using the geometric Beltrami framework. The orientation vector field is 
represented accordingly as the embedding of a two dimensional surface in 
the spatial-feature manifold. Orientation diffusion is treated as a canoni- 
cal example where the feature (orientation in this case) space is the unit 
circle . Applications to color analysis are discussed and numerical ex- 
periments demonstrate again the power of this framework for non-trivial 
geometries in image processing. 

1 Introduction 

Feature enhancement is part of algorithms that play a major role in many image 
analysis and processing applications. These applications include texture process- 
ing in a transform space, disparity and depth estimation in stereo vision, robust 
computation of derivatives, optical flow and orientation vector fields in color 
processing. We are witnessing the emergence of a variety of methods for feature 
denoising that generalize the traditional image denoising. Here, we show how 
harmonic map methods, defined via the Beltrami framework, can be used to 
perform adaptive feature and shape denoising. 

We concentrate on the example of direction diffusion that gives us informa- 
tion on the preferred direction at every pixel. This example is useful in applica- 
tions like texture and color analysis and can be generalized to motion analysis. It 
incorporates all the problems and characteristics of more involved feature spaces. 

The input to the denoising process is a vector field on the image. The values 
of this vector field are in the unit circle S^. The given vector field is noisy and 
we wish to denoise it under the assumption that, for the class of images we are 
interested in, this vector field is piecewise smooth, see [2] for our analysis of 
higher dimensional spheres S'". 

Two approaches for this problem are known: Perona was the first to addresses 
directly this issue jOI, he uses a single parameter 0 as an internal coordinate in 
S^. Next, Tang, Sapiro and Casseles mm embedded the unit circle S^ in IR^ 
(the sphere S^ in IR^) and work with the external coordinates, see also for 
a related effort. The first approach is problematic because of the periodicity of 
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iS”^. Averaging small angles around zero such as 0 = e and 6 = 2tt — e leads 
to the erroneous conclusion that the average angle is 0 = tt. Perona solved 
this problem by exponentiating the angle such that V = e*®. This is actually 
the embedding of in (D which is isometric to This method is specific 
to two-dimensional embedding space where complex numbers can be used. The 
problem in using only one internal coordinate manifests itself In the numerical 
implementation of the PDE through the braking of rotation invariance. In the 
second approach we have to make sure that we stay always in along the flow. 
This problem is known as the projection problem. It is solved in the continuum 
by adding a projection term. Chan and Shen also use external coordinates 
with a projection term but suggest to use a Total Variation (TV) measure |H] 
in order to better preserve discontinuities in the vector field. This works well 
for the case where the codimension is one, like a circle. Yet it is difficult to 
generalize to higher codimensions like the sphere. Moreover, the flow of the 
external coordinates is difficult to control numerically since errors should be 
projected on 5^ and no well-defined projection exist. 

We propose a solution to these problems and introduce an adaptive smooth- 
ing process, that preserves orientation discontinuities. The proposed solutions 
work for all dimensions and codimensions, and overcome possible parameteriza- 
tion singularities by introducing several internal coordinates on different patches 
(charts) such that the union of the patches is S’". Adaptive smoothness is 
achieved by the description of the vector field as a two-dimensional surface em- 
bedded in three- and four-dimensional spatial-feature manifold for the and 
cases respectively. We treat here the case only due to space limitations. 
The problem is formulated, in the Beltrami framework H2E] in terms of the 
embedding map 

Y ■,{E,g)^{M,h) 

where E is the two-dimensional image manifold, and M, in the following ex- 
amples is H" X with n = 2 (n = 4) for gray-level (color) images. The key 
point is the choice of local coordinate systems for both manifolds. Note the 
difference w.r.t. where the image metric is flat. At the same time we 

should verify that the geometric filter does not depend on the specific choice of 
coordinates we make. 

Once a local coordinate system is chosen for the embedding space and the 
optimization is done directly in these coordinates the updated quantities lie 
always in Ml Other examples of enhancement by the Beltrami framework of 
non-fiat feature spaces, like the color perceptual space and the derivatives vector 
field, can be found in [Enni. 



2 The Beltrami Framework 



Let us briefly review the Beltrami geometric framework for non-linear diffusion 
in computer vision m 
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2.1 Representation and Riemannian Structure 

We represent an image and other local features as an embedding map of a Rie- 
mannian manifold in a higher dimensional space. The simplest example is the 
image itself which is represented as a 2D surface embedded in We denote 
the map by F : F — >■ Where if is a two-dimensional surface, and we de- 
note the local coordinates on it by The map Y is given in general 

by {Y^{x^,x^),Y‘^{x^,x‘^),Y^{x^,x‘^)). We choose on this surface a Riemannian 
structure, namely, a metric. The metric is a positive definite and a symmetric 
2-tensor that may be defined through the local distance measurements 

ds'^ = -I- 2gi2dx^dx'^ + g22(.dx^Y . (1) 

We use below the Einstein summation convention in which the above equation 
reads ds^ = gf^i^dx^dx’' where repeated indices are summed over. We denote the 
inverse of the metric by g^'^ . 

2.2 Image Metric Selection: The Induced Metric 

A reasonable assumption is that distances we measure in the embedding spatial- 
feature space, such as distances between pixels and difference between grey- 
levels, correspond directly to distances measured on the image manifold. This is 
the assumption of isometric embedding under which we can calculate the image 
metric in terms of the embedding maps F* and the embedding space metric hij. 
It follows directly from the fact that the length of infinitesimal distances on the 
manifold can be calculated in the manifold and in the embedding space with 
the same result. Formally, ds^ = g^j^^dx^dx'^ = hijdY^dYY By the chain rule, 
dF* = d^Y'-dx^, we get ds'^ = g^^dx^^dx^ = hijOf^Y^d^Y^dx^dx’' . From which 
we have 

= hjd^Y^d^YY (2) 

Intuitively, we would like our filters to use the support of the image surface 
rather than the image domain. The reason is that edges can be considered as 
‘high cliffs’ in the image surface, and a Gaussian filter defined over the image 
domain would smooth uniformly everywhere and will not be sensitive to the 
edges. While, a Gaussian filter defined over the image manifold, would smooth 
along the walls of the edges and preserve these high cliffs of the image surface. 

As an example let us take the gray-level image as a two-dimensional image 
manifold embedded in the three dimensional Euclidean space IR^. The embed- 
ding maps are 

{Y^{x^,x^) = x\Y^{x\x'^) = x^,Y^{x\x^) = /(x\x^)). (3) 

We choose to parameterize the image manifold by the canonical coordinate sys- 
tem x^ = X and x^ = y. The embedding, by abuse of notation, is {x,y,I{x,y)). 
The induced metric gn element is calculated as follows 

gii = hijd^iY^d^iY^ = dijdxY^dxY^ = dxxdxX+dxyd^y+dxIdxI = 1+/^. (4) 

Other elements are calculated in the same manner. 
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2.3 A Measure on the Space of Embedding Maps 

Denote by {S,g) the image manifold, and its metric and by (M,h) the space- 
feature manifold and its metric. Then the functional attaches a real 

number to a map Y \ S ^ M 

S[Y\g^,,h^]= J dV{VY\VY^)gh,,, (5) 



where dV = dx^dx^ ■ ■ ■ dx^^/g is a volume element and the scalar product (, )g 
is defined with respect to the image metric i.e. (VF^,VF^)g = g^’^dgY^d^Y'^. 
This functional is known in high-energy physics as the Polyakov action [S|. Note 
that the image metric and the feature coordinates i.e. intensity, color, orientation 
etc. are independent variables. The minimization of the functional with respect 
to the image metric can be solved analytically in the two-dimensional case (see 
for example mi)- The minimizer is the induced metric. If we choose, a-priory, the 
image metric induced from the metric of the embedding spatial-feature space M, 
then the Polyakov action is reduced to an area (volume) of the image manifold. 

Using standard methods in the calculus of variations (see El), the Euler- 
Lagrange equations with respect to the embedding are 

- = ^dg{^gg>^''d.Y^)+r]^{VY=,VY^)g. ( 6 ) 

Since {g^v) is positive definite, g = dei{ggu) > 0 for all x^^. This factor is the 
simplest one that does not change the minimization solution while giving a geo- 
metric (reparameterization invariant) expression. The operator that is acting on 
U* in the first term is the natural generalization of the Laplacian from fiat spaces 
to manifolds and is called the second order differential parameter of Beltrami 0, 
or for short Beltrami operator, and is denoted by Ag. The second term involves 
the Levi-Civita connection whose coefficients are given in terms of the metric of 
the embedding space 

{djhik + dkhgi - dihjk) ■ (7) 



This is the term that guarantees that the image surface flows in a non-Euclidean 
manifold and not in M". 

A map that satisfies the Euler-Lagrange equations = 0 is a 

harmonic map. The one- and two-dimensional examples are a geodesic curve 
on a manifold and a minimal surface. 

The non-linear diffusion or scale-space equation emerges as the gradient de- 
scent minimization flow 



w® w® 

‘ dt 



2ffg 5Y^ 






( 8 ) 



This flow evolves a given surface towards a minimal surface, and in general it 
changes continuously a map towards a harmonic map. 
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There are few major differences between this flow and those suggested in 
EHI. Notably, the metric that is used in those papers is flat while we use the 
induced metric that combines the information about the geometry of the signal 
and that of the feature manifold. 

3 Beltrami Direction Diffusion 

We are interested in attaching a unit vector field to an image. More precisely 
we would like to construct a non-linear diffusion process that will enhance a 
given noisy vector field of this form while preserving the unit magnitude of each 
vector. 



3.1 The Embedding Space Geometry 

Hemispheric coordinate system. Denote the vector field by two compo- 
nents {U, VY' such that U'^ + V‘^ = 1. This description is actually an extrinsic 
description. The unit circle is a one-dimensional curve and one parameter 
should suffice as an internal description. Since is a simply connected and 
compact manifold without boundaries, we need at least two coordinate systems 
to cover all the points of such that the transition function between the two 
patches is infinitely differentiable at all points that belong to the intersection of 
the two patches. We define the two patches as follows: The coordinate system 
on — {(±1,0)} is U, with the induced metric 

dsli = dU^ + dV"^ = dU"^ + (d(v^l-C/2))2 = 

The coordinate system on — |(0,±1)} is V with the induced metric 

dsli = dU^ + dV^ = {d{Vl - + dV^ = (10) 

It is clear the the transformations V{U) = \/l — C/2 and U{V) are differen- 
tiable anywhere on the intersection — {(±1,0), (0,±1)|. 

The embedding is of the two-dimensional image manifold in the three- 
dimensional space IR^ x S^. The canonical embedding for the first patch is 
(Y^{x,y) = x,Y'^(x,y) = y,Y^{x,y) = U{x,y)), and for the second patch is 

(Y\x,y) = x,Y‘^{x,y) = y,Y^{x,y) = V{x,y)). (11) 



Stereographic coordinate system. Another possibility is to use the stere- 
ographic coordinate system. The stereographic transformation gives the values 
of y® as functions of the points on the north (south) hemispheres of the hy- 
persphere. Explicitly, for it is given (after shifting the indices by two for a 
coherent notation with the next sections) as Y^ = ’ Inverting these relations 

we find 



C/3 



2y3 



4 ^ - 1 ±^^ 

l±(y3)2 ’ 



1± (F3)2 



( 12 ) 



Using Beltrami Framework for Orientation Diffusion in Image Processing 351 



and the induced metric is 

r 4 

- (l + (y3)2)2^d 

Due to space limitations we defer further analysis on the stereographic coordinate 
system. Below we analyze the hemispheric coordinate system. 



3.2 The Beltrami Operator 



The line element on the image manifold is 



= ds'^2 + els'll = dx^ + dy^ + 



l-C/2 



dU^ 



(13) 



and by the chain rule 



ds^ = (1 + A{U)Ul)dx^ + 2A{U)Uo,Uydxdy + (1 + A{U)U^)dy^, (14) 

where A{U) = and similarly for V. 

The induced metric is therefore 



(1 + A{U)U^, A{U)U,Uy\ 
\A{U)U,Uy l + AiU)U^J’ 



(15) 



and the Beltrami operator acting on U is AgU = ■^dy,{y/gg^''dyU), where 
g = 1 + A{U){U^ + Uy) is the determinant of (g^y), and {g^'') is the inverse 
matrix of {gyv)- 



3.3 The Levi-Civita Connection 

Since the embedding space is non-Euclidean we have to calculate the Levi-Civita 
connection. Remember that the metric of the embedding space is 

/I 0 0 \ 

iK) =01 0 . (16) 

VO 0 A{U)) 

The Levi-Civita connection coefficients are given by the fundamen- 
tal theorem of Riemannian geometry in the following formula = 

{djhik -I- dkhji — dihjk ) , where the derivatives are taken with respect to T* 
for i = 1, 2, 3. 

The only non- vanishing term is Tgg that reads Tgg = t/hgg. 

The second term in the EL equations in this case reads C//igg||VC/||2. We can 
rewrite this expression as 

hg3||VC7||^=2 - ^(1+5), 

9 

where we used the induced metric identity Eq. (|2|l . 



( 17 ) 
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3.4 The Flow and the Switches 

The Beltrami flow is 

Yl = AgY"^ + (18) 

for i = 3, and similarly for the other coordinate charts. Gathering together all 
the pieces we Anally get the Beltrami flow 

= AgU , Vt = AgV (19) 

5 9 

In the implementation we compute the diffusion for U and V simultaneously 
and take the values (G, sign(F)\/l — [/^) for the range and the values 

(sign([/)\/l — F) for the range F^ < 



4 Color Diffusion 



There are many coordinate systems and models of color space which try to be 
closer to human color perception. One of the popular coordinate systems is the 
HSV system jOj. In this system, color is characterized by the Hue, Saturation 
and Value. The Saturation and Value take their value in IR+, while the Hue is 
an angle that parameterizes S^. 

In order to denoise and enhance color images by a non-linear diffusion process 
which is adapted to human perception we use here the HSV system. We need to 
have special treatment of the Hue coordinate in the lines of Section 0 

Let us represent the image as a mapping Y : V — >■ x where V is the 

two-dimensional image surface and x is parameterized by the coordinates 
{x, y, H, S, V). As mentioned above, a diffusion process in this coordinate system 
is problematic. We define therefore two coordinates 



U = cos H ; IF = sin iL 



and continue in a similar way to Section 0 The metric of IR'^ x on the patch 
where U parameterizes and W{U) is non-singular is 



hij 



/I 0 0 0 0\ 

0 1 0 0 0 

0 0 A(U) 0 0 

0 0 0 1 0 

Vo 0 0 01/ 



where A{U) = 1/(1 — t/^). 

The induced metric is therefore 



(20) 



ds^ = dx^ Y dy^ Y A{U)dU^ Y dS^ Y dV^ 

= dx^ Y dy^ Y A{U){Uxdx Y Uydy)^ Y {S^dx Y SydyY + {Vxdx Y Vydy)^ 
= (lYA{U)U^YSlYV^)dx^Y 
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2{A{U)UxUy + SxSy + VxVy)dxdy + (1 + A[U)Uy + + Vy)dy^ . (21) 

Similar expressions are obtained on the other dual patch. 

The only non- vanishing Levi-Civita connection’s coefficient is ^33 = Uh^^. 
The resulting flow is 



Ut = AgU + 2U-U{g^^ + g^^) 

Wt = AgW + 2W- W(g^^ + 5^2) 

St = AgS 

Vt = AgV. (22) 

Note that the switch between U and W should be applied not only to the U and 
W equations but also to the S and V evolution equations where, at each point, 
one needs to work with the metric that is deflned on one of the patches. 

5 Experimental Results 

Our first example deals with the gradient direction flow via the Beltrami frame- 
work. Figure [D shows a vector held before and after the application of the flow 
for a given evolution time. The normalized gradient vector held extracted from 
an image is presented before and after the flow and show the way the held flows 
into a new smooth orientation transactions held. 

Next, we explore a popular model that captures some of our color perception. 
The HSV (hue, saturation, value) model proposed in 0 is often used as a ‘user 
oriented’ color model, rather than the RGB ‘machine-oriented’ model. 

Figure El shows the classical representation of the HSV color space, in which 
the hue is measured as an angle, while the value (sometimes referred to as bright- 
ness) and the color saturation are mapped onto finite non-periodic intervals. This 
model lands itself into a Alter that operates on the spatial x^y coordinates, the 
value and saturation coordinates, and the hue periodic variable. Our image is 
now embedded in M'* x . 






'r. 






^:j 
















Fig. 1. Two vector fields before (left) and after (right) the flow on . 
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Fig. 2. The HSV color model captures human color perception better than the RGB 
model which is the common way our machines represent colors. The original image 
(left), the noisy image (middle) and the filtered image (right) demonstrate the effect of 
the flow as a denoising filter in the HSV color space. For further examples follow the 
links at http : / lwww.cs.technion.ac.il/ ~ ron/pub.html 



6 Concluding Remarks 



There are two important issues in the process of denoising a constrained feature 
field. The first is to make the process compatible with the constraint in such a 
way that the latter is never violated along the flow. The second is the type of 
regularization which is applied in order to preserve significant discontinuities of 
the feature held while removing noise. 

These two issues are treated in this paper via the Beltrami framework. First 
a Riemannian structure, i.e. a metric, is introduced on the feature manifold 
and several local coordinate systems are chosen to describe intrinsically the 
constrained feature manifold. The diffusion process acts on these coordinates and 
the compatibility with the constraint is achieved through the intrinsic nature 
of the coordinate system. The difficulty in working on a non-Euclidean space 
transforms itself to the need to locally choose the best coordinate system to 
work with. 

A preservation of significant discontinuities is dealt with by using the induced 
metric and the corresponding Laplace-Beltrami operator acting on feature coor- 
dinates only. This operation is in fact a projection of the mean curvature normal 
vector on the feature direction(s). This projection slows down the diffusion pro- 
cess along significant (supported) discontinuities, i.e. edges. 

The result of this algorithm is an adaptive smoothing process for a con- 
strained feature space in every dimension and codimension. As examples we 
showed how our geometric model coupled with a proper choice of charts han- 
dles the orientation diffusion problem. This is a new application of the Beltrami 
framework proposed in mi- We tested our model on vector fields restricted to 
the unit circle S^, and hybrid spaces like the HSV color space. The integration 
of the spatial coordinates with the color coordinates yield a selective smoothing 
Alter for images in which some of the coordinates are restricted to a circle. 
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Abstract. Techniques to estimate the surface area of regular solids 
based on polyhedrization are classified to be either local or global. Surface 
area calculated by local techniques generally fails to be multigrid con- 
vergent. One of the global techniques which is based on calculating the 
convex hull shows a tendency to be multigrid convergent. However this 
algorithm only deals with convex sets. The paper estimates the surface 
area using another global technique called DPS (Digital Planar Segment) 
algorithm. The projection of these DPSes into Euclidean planes is used 
to estimate the surface area. Multigrid convergence experiments of the 
estimated surface area value are used to evaluate the performance of this 
new method for surface area measurement. 



1 Introduction 

Gridding techniques are widely used to represent volume data sets in three- 
dimensional space in the field of computer-based image analysis. The problem 
of multigrid convergent surface area measurement had been discussed for more 
than one hundred years |8pi tij and not yet reached a satisfactory result. In |2|, 
regular solids are defined as the models of ‘3D objects’ and as sets in Ti? being 
candidates for multigrid surface area studies. The surface areas of such sets are 
well defined. 

C. Jordan ^ studied the problem of volume estimation based on gridding 
techniques in 1892, and C. F. Gauss (1777-1855) studied the area problem for 
planar regions also based on this technique. Gridding is in today’s point of view 
also considered as digitization which maps a ‘real object’ into a grid point set 
with certain resolution r defined as being the number of grid points per unit. 
Jordan digitization is characterized by inclusion digitization, being the union of 
all cubes completely contained in the topological interior of the given regular 
solid, and intersection digitization, being the union of all cubes having a non- 
empty intersection with the given set. Gauss center point digitization is the 
union of all cubes having a center point in the given set. The Gauss center 
point digitization scheme is used in this paper. It maps given regular solids into 
orthogonal grid sets, in which each grid edge (i.e. an edge of a grid square or 
grid cube) is of uniform length (grid constant). 
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The effect of studying objects enlarged r times considering the grid con- 
stant as 1 is the same as studying the object of original size having resolution r 
(or grid constant 1/r). The advantages of the former (preferred by Jordan and 
Minkowski) are that calculations of surface areas are integer arithmetic opera- 
tions and it is chosen in our implementation later illustrated. This is also called 
the general duality principle for multigrid studies. 

The paper ^ proposed a way to classify polyhedrization schemes for surfaces 
of digitized regular solids based on the notion of halls of influence B{p). Let Dfld) 
be the digitized set of a regular solid 9 with a grid resolution r. For each polygon 
G of the constructed polyhedron there is a subset of DflO) such that only grid 
points in the subset have influence on the specification of polygon G. This subset 
is the ball of influence. Due to the finite number of calculated polygons, there is 
a maximum value R{r, O) of all radii of the ball of influence. The polyhedrization 
techniques are then classified based on the following criterion: 

Definition 1. A polyhedrization technique is local if there exists a constant k 
such that R{r, 0) < n/r, for any regular solid 0 and any grid resolution r. If a 
polyhedrization method is not local then it is global. 

For the surface detection algorithm of |5] , the marching cubes algorithm and 
marching tetrahedra algorithm, the constant k is not more than ^ HU. 

The soundness of a grid technique as pointed out by m should meet in short 
the following properties with respect to surface area estimation: 

1. higher resolutions should lead to convergence for the calculated value of 

surface area, and 

2. convergence should be towards the true value. 

Obviously, surface area calculation by counting the surface faces (grid faces of the 
digital surface) of the digitized regular solid is not meeting the criteria. Since the 
result of doing this may converge to the value from 1 to \/3 times the true value 
depending on the shape and position of the given object when the resolution r 
goes to infinity m- Other local polyhedrization techniques as investigated in 0 
such as marching cube and dividing cube algorithms, although all converge, fail 
to converge to the true value. 

On the other hand, a proposed global technique, relative-convex hull poly- 
hedrization HU which is implemented for convex sets by the calculation of the 
convex hull, converges to the true value when dealing with these convex sets 
P). Our method is also a global polyhedrization technique. In this method the 
given digital object is first polyhedrized by agglomerating surface faces into max- 
imum digital planar segments (DPSes). Then the surface area of polyhedra is 
calculated by summing the area of the those digital planar segments and is used 
to estimate the surface area of the digitized object. Our global polyhedrization 
technique is called DPS method. 

Section 2 gives our definition of a digital plane segment. Section 3 explains the 
algorithm to recognize a digital plane segment, and to perform the surface area 
calculation. The experimental results are presented and discussed in Section 4. 
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Fig. 1. De nition of a DSS based on main diagonals; li is the line containing the 
maximum length tangential line segment of the 4-curve, h is the second tangential line 
parallel to Zi, and a < \[2 is the main diagonal distance between both. Vector n is the 
normal of Zi, and v is the vector of length %/2 of the main diagonal 



We conclude in Section 5. In this paper, we restrict our interest on implemen- 
tation and experimental results. The theoretical convergence analysis is left to 
another paper. 



2 Digital Planar Segments 

We generalize from a definition of a digital straight segment (DSS) in the plane 
as introduced in arithmetic geometry mm . Two parallel lines having a given 
4-path in between are called bounding lines. There are two possible diagonals in 
2D grid space, see Fig. ^ The main diagonal is the one which maximizes the 
dot product with the normal of these lines. The main diagonal distance between 
these parallel lines is measured in direction of the main diagonal. A 4-path is 
a DSS iff there is a pair of bounding lines having a main diagonal distance less 
than -\/2. 

We will define a digital planar segment (DPS) in 3D space by a main diagonal 
distance between two parallel planes. We distinguish them by defining one to be 
a supporting plane and the other as tangential plane. For a finite set of surface 
faces we consider a slice defined by these two planes. The supporting plane is 
defined by at least three non-collinear vertices of this face set, and all faces of 
this set are in only one of the (closed) halfspaces defined by this plane. Note 
that any non-empty finite set of faces has at least one supporting plane. Any 
supporting plane defines a tangential plane which is touching the vertices of this 
set ‘from the other side’. Figure El gives a rough illustration of a DPS. Again, n 
denotes the normal, and v is the vector of length \/3 in direction of the main 
diagonal. 

A grid cube has eight directed diagonals. The main diagonal of a pair of such 
two planes is that diagonal direction out of these eight directed diagonals which 
has the maximum dot product (inner product) with the normal of the planes. 
Note that there may be more than one main diagonal direction for the planes, 
and we can choose any of these as our main diagonal. The distance between 
supporting plane and tangential in the main diagonal direction is called main 
diagonal distance. 
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supporting plane 




two possible diagonals 



Fig. 2. A 3D view of a DPS: the main diagonal distance between the two bounding 
planes is less than -\/3 



Definition 2. A finite set of grid cubes in 3D space, being edge- connected, is a 
DPS iff there is a pair of parallel planes (a supporting plane and a tangential 
plane) containing this set in-between, and having a main diagonal distance less 
than \/3. A supporting plane of a DPS is called effective supporting plane. 

Therefore the main diagonal distance between the effective plane and its 
corresponding tangential plane is less than \/3. For each DPS, there is possible 
to have more than one effective supporting plane. 

Let v be a vector in a main diagonal direction with length of \/3, n be the 
normal vector of the pair of planes and d = n-phe the equation for one of these 
two planes. According to Def. 0 all the vertices p of the grid cubes of a DPS 
must satisfy the following inequality, 



0<n-p — d<n-v (1) 

Let n = (a,b,c). Since is a vector of which all three elements have an 
absolute value 1 and the angle between n and v is always less than or equals to 
90°, Equ. [H becomes 



0 < ax -\- by -\- cz — d < |a| + |6| + |c| (2) 

This equation is equivalent to that of the standard plane defined originally 
by Reveilles H3- Let w = |a| + |6| + |c|, where w is called the thickness of a 
discrete plane. This thickness guarantees that the DPS remains tunnel free and 
without simple points P . This is the optimal arithmetical thickness for discrete 
planes. It follows that the normals of the effective supporting planes of a DPS 
are uniquely defined. Given the boundary of a DPS and one of the effective 
supporting planes, it is possible to reconstruct the original digital plane. Further 
digital plane models and algorithms may be found in mHEni. 
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3 The Algorithm 

The digital surface of a regular solid consists of grid faces, and these may be 
traced using the surface tracking algorithm |2| to build a surface graph. Each 
node of the surface graph contains a grid face as well as four pointers to all of its 
four edge-adjacent faces. By this representation, we can implement a breadth- 
first search of all the faces to agglomerate faces into segments such that all faces 
in one segment belong to one DPS. We are interested in calculating maximum 
sets of faces satisfying the DPS definition. 

The approach to agglomerate faces according to Equ.Qis a classical problem 
which is also known as recognition of a digital plane in computer imagery. Of 
course, the definition of a digital plane may vary. The problem can be represented 
as follows: given n points {pi,P 2 , ■ ■ ■ ^Pn}, does there exist a DPS such that each 
point satisfies the inequality Equ. Q i.e. is it possible to solve the following 
inequality system: 



0<n-pi — d<nv i = (3) 

This inequality system contains four scalar unknowns, d and n = (a,b,c). 
Since n has to be a normal vector in this case (i.e. of unit length), Equ. E|) is 
not just a linear inequality system and hence not a trivial problem to find out 
whether a feasible solution exists or not. However, by eliminating d, we get a 
new inequality system Equ. 0 which has inequalities: 

n ■ Pi — n ■ pj < n ■ V, i,j = l,...,n (4) 

This inequality system no longer requires n to be a normal vector, thus it is 
turned into a linear homogeneous inequality system with three unknowns a, b, c 
and it can be solved in various ways. For example, the paper 0 presents a Fourier 
elimination approach. Computationally this turned out to be not time-efficient. 
More efficient algorithms include operation research’s linear programming, which 
is to find an optimal solution given an object function and a set of linear con- 
straints. If such a solution exists, the inequality system is solved. Quick Hull 
pm is another method which constructs a convex hull by cutting the 3D space 
given a finite set of half-spaces. By finding out whether a non-empty convex hull 
exists, the inequality system is then solved. The CDD |2H (c implementation 
of Double Description Method of Motzkin) algorithm generates all vertices, i.e. 
extreme points and extreme rays of a general convex polyhedron in R‘^ given 
by a system of linear inequalities. Our experiments show that the CDD works 
fastest in our case. However when increasing the grid resolutions, the number 
of inequalities becomes very large, solve the inequality system over and over is 
proved very slow and not sufficient for our multigrid convergence study. It is 
critical to have an elegant and efficient algorithm to solve the inequality system. 

Using the idea of effective supporting planes, an incremental algorithm can 
be achieved by keeping a list of the effective supporting planes. Given a DPS 
(the effective supporting planes are known and kept in a list) and to test a 
new vertex, we first need to find out whether the existing effective supporting 
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planes are still effective to the new vertex. Those not are deleted from the list. 
New supporting planes then will be constructed with the new vertex. Those new 
supporting planes are added to the list if they are effective. If after all the list 
is empty, then the vertex failed to add to the current DPS. Constructing new 
supporting planes with a new vertex is done by incrementally constructing the 
convex hull of a DPS (El). 

We start the DPS recognition process from a face graph which is obtained by 
applying the surface tracking algorithm introduced in |2j. After the face graph 
is constructed, we start breadth-first search in the face graph to agglomerate 
the faces into DPSes. This breadth-first search algorithm is implemented using 
two queues. One is called a seeds queue containing all searched faces not yet 
belonging to any recognized DPS. Whenever a face cannot be added to the 
current DPS, it is enqueued to the seeds queue. The next DPS will start from 
one face chosen from the seeds queue. The second queue is used to maintain the 
breadth-first search so that the growing of the DPS looks like the propagation 
of a circular wave. Searching in this way the shape of the DPS is expected more 
close to a circle. Applying different search strategy will result in different DPS 
segmentation. For example, depth-first search will generate DPSes with narrow 
shape. We will show the impacts of these factors on our surface area measurement 
in the experimental results section. 

When an adjacent face is found, we try to add it to the current DPS. Only 
when all four vertices of a face belongs to the current DPS (those vertices already 
on the DPS need not test), it can be added to the current DPS and deleted from 
the seeds queue if it is also there. Otherwise we enqueue this face to the seeds 
queue and try another adjacent face. If no further adjacent face can be added, 
we start a new DPS from a face in the seeds queue. 

There is a few consideration to further speed up the process. For example the 
set of surface faces are all in one direction, these faces must belong to a DPS. If all 
the surface faces are in one direction and the next face has a different direction, 
this next face must belong to the same DPS too. At most three directions of the 
surface faces are allowed in one DPS. 

The algorithm is summarized in Fig. El The computational complexity of 
our surface tracking algorithm is O(n^). Incrementally compute the convex hull 
costs O(n^) too. They are outside each other’s loop. Therefore the overall cost 
of our DPS recognition algorithm is O(n^). 

The constructed polyhedron is composed of DPSes, which are in turn com- 
posed of faces. Those faces are not coplanar in the sense of Euclidean geometry. 
To evaluate the surface area of a DPS, we must first project the surface faces on 
one of the bounding planes of the DPS and then sum up the projected area. The 
surface area of the original regular solid is then estimated by the sum of areas 
of all DPSes. 
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Fig. 3. Steps involve in our incremental DPS recognition algorithm 











Fig. 4. A sphere and a 20*16*12 ellipsoid digitized at r = 40 and polyhedrized by 
DPSes with a face search depth limited to 7. 



4 Experimental Results 

The calculation of the surface area of an ellipsoid with all three semi-axes a, b, c 
being allowed to be different is known to be a complicate task. If two radii coin- 
cide, i.e. in case of an ellipsoid of revolution, the surface area can by analytically 
specified in terms of standard functions. The surface area formula in the general 
case may be based on elliptic integrals. Example 2 in m specifies an analytical 
method to estimate the surface of such ellipsoids. The value calculated by this 
method for surface area is used in this paper as ‘ground truth’ to evaluate the 
performance of our DPS algorithm. 

Fig. El shows a digitized sphere and ellipsoid where faces are grouped into 
DPSes with the breadth-first search strategy. The search depth is restricted to 
7. The total numbers of faces of digitized sphere and ellipsoid are 7584 and 4744, 
respectively, at resolution r=80. The number of DPSes of the sphere and ellipsoid 
are 247 and 160, respectively. The average size of one DPS is approximately 30 
faces in both cases. 
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Fig. 5. Surface area estimation by Fig. 6. Surface area estimation by 

breadth-first search depth-first search 



With depth restriction on search depth, the sizes of the DPSes are more 
evenly distributed, and the DPSes have more regular shapes. Histograms for 
various search depth values can be found in m- More evenly distributed sizes 
and a more regular shape of the DPSes leads to more accurate surface area 
estimations for general ellipsoids. Figure El illustrates surface area measurement 
for a 20 by 16 by 12 ellipsoid in standard position (i.e. its axes are parallel to 
the coordinate axes). It shows that the algorithm performs best when the depth 
limitation is chosen appropriately. This is because if the search depth is limited 
at a small value such as 5 the global technique is transformed into a local one. 

Results using a depth-first search strategy are shown in Fig. El Since the 
shapes of the DPSes are more irregular in this case, without search depth lim- 
itation, the result is less accurate than the breadth-first search. When search 
depth applies, the accuracy is closer to the breadth-first search. 

Both search strategies of the DPS technique show a convergence tendency to 
the true value, but not homogeneously. This is due to the fact that with different 
resolution and the choice of initial faces for each DPS, the resulting polyhedron 
may differ. 

The marching cubes and the convex hull techniques are compared with our 
DPS algorithm, and and results are shown in Fig.0 Compared to the convex hull 
or marching cubes algorithm at the same resolutions, the accuracy of our DPS 
method is far better than the other two. The convex hull method and marching 
cubes have a relative error of 3.2% and 8.7% respectively at r = 100, while the 
DPS method is only 0.67%. Note that this result is somewhat dependent on the 
initial face and search strategy. 

Compared to algorithms based on solving the inequality system, our incre- 
mental algorithm dramatically improves the computational efficiency. Figure 0 
shows the computational time for polyhedrizing a 20 * 16 * 12 ellipsoid with res- 
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Fig. 7 . Surface area estimation by Fig. 8. Computational time for Fourier 

marching cubes, convex hull and DPS. elimination, CDD and DPS algorithm. 














Fig. 9. Cut through a non-convex solid 



Fig. 10. Surface area estimation of a 
non-convex solid 



olutions from 10 to 100 and with search depth restriction of 7. The comparison 
was done on a Pentium II 350 running Linux. 

An impossible situation for the convex hull method is given when non-convex 
solids have to be considered. Hence we also test our algorithm using non-convex 
regular solids defined by an ellipsoid having an ‘inner tangential ellipsoidal hole’. 
Half of such a solid is visualized in Fig. 0 Results for three positions of such a 
solid are shown in Fig. cni 
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5 Conclusion 

We designed and implemented a global polyhedrization algorithm and estimated 
the surface area of regular solids. Compared to other polyhedrization techniques 
such as the marching cubes and convex hull technique, our DPS method con- 
verges fast and is converging towards the true value. Our algorithm applies to 
non-convex objects as well. Although the relative error is already less than 0.5% 
with resolution 200, we can see that the convergence speed slows down and we 
can hardly tell whether it does finally converge to the true value. This is a general 
limitation of experimental studies. Note that the intersection of the Euclidean 
projections of the DPSes does not form a complete polyhedron in general, i.e. 
this might be a reason for a possible (not verified!) difficulty to ensure absolutely 
perfect convergence to the true value. Further investigations are needed. 
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Abstract. A method for deformable shape-based image segmentation is 
described. Regions in an image are merged together and/or split apart, 
based on their agreement with an a priori distribution on the global 
deformation parameters for a shape template. Perceptually-motivated 
criteria are used to determine where and how to split regions, based 
on the local shape properties of the region group’s bounding contour. 
In general, finding the globally optimal region partition for an image is 
an NP hard problem; therefore, two approximation strategies are em- 
ployed: the highest confidence hrst algorithm and shape indexing trees. 
Experiments show that the speedup obtained through use of the approx- 
imation strategies is significant, while accuracy of segmentation remains 
high. Once trained, the system autonomously segments shapes from the 
background, while not merging them with adjacent objects or shadows. 



1 Introduction 

Segmentation using traditional low-level image processing methods, such as re- 
gion growing or region split/merge, requires a considerable amount of interac- 
tive guidance in order to get satisfactory results. One solution is to exploit prior 
knowledge to sufficiently constrain the segmentation problem. For instance, a 
model based segmentation scheme can be used in concert with image prepro- 
cessing to guide and constrain region grouping PQEO]. However, due to shape 
deformation and variation within object classes, a simple rigid model-based ap- 
proach will break down in general. This realization has led to the use of de- 
formable shape models in image segmentation |HI!/4E17l4!/j . 

Unfortunately, the above mentioned techniques are going to make mistakes in 
merging regions, even in constrained contexts. This is because local constraints 
are in general insufficient; to gain a more reliable segmentation, global consis- 
tency must be enforced. This idea is embodied in the principle of global coherence 
pg|: the best partitioning is the one that globally and consistently explains the 
greatest portion of the sensed data [201 • Unfortunately, finding the globally opti- 
mal image partition is an NP hard problem; therefore, approximation strategies 
were proposed to achieve a practical system; e.g., simulated annealing high- 
est confidence first (HCF) jj|, or agglomerative clustering methods |S|. 

In this paper, a method for deformable shape-based image segmentation is 
described. Regions in an image are merged together and/or split apart, based 
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on their agreement with an a priori distribution on the global deformation pa- 
rameters for a particular shape class. The likelihood of a region merge or split is 
evaluated using a cost measure that includes region compatibility, region/model 
area overlap, and a deformation likelihood term. Perceptually-motivated criteria 
are used to determine where/how to split regions, based on the local shape prop- 
erties of the region group’s bounding contour. An initial version of our system 
used HCF in region merging This paper describes the basic segmentation 
formulation, plus two substantial improvements: region splitting and shape in- 
dexing trees. Experimental results show that these extensions improve both the 
speed and accuracy of shape-based segmentation. Once trained, the system au- 
tonomously segments deformed shapes from the background, while not merging 
them with adjacent objects or shadows. 



2 Related Work 



Previous work in deformable shape segmentation stems from the active contours 
formulation [811 2r/!2|‘/!4l37] . The active contours formulation can be extended to 
include a term that enforces homogeneous properties over the region contained 
within the contour during region growing [6I20[23I(!T^ . This hybrid approach 
offers the advantages of both region-based and deformable modeling techniques, 
and tends to be more robust with respect to model initialization and noisy data. 
However, it requires hand-placement of the initial model, or a user-specified seed 
point on the region’s interior. 

Other approaches use special-purpose deformable templates [22l‘27f4 ij . The 
inclusion of object-specific knowledge in the model is used to further constrain 
segmentation, resulting in enhanced robustness to occlusion and noise. Further- 
more, the recovered template parameters can be used for shape recognition. 
These methods require the careful construction and parameterization of tem- 
plates. Deformable templates can be derived semi-automatically, via statisti- 
cal analysis of shape training data mazg. The estimated probability density 
function (PDF) for the shape deformation parameters can be used in Bayesian 
segmentation methods. 

From another view, image segmentation is a labeling problem; the ideal seg- 
mentation should be consistent or nearest to the one with maximum likelihood. 
This has led to various relaxation labeling or stochastic labeling methods that 
are related to general optimization algorithms I2CM- These techniques re- 
quire prior information, such as the number of labels needed and the probability 
distribution of these labels in the image. However, such prior information is not 
always available or is inaccurate for general imagery. One common solution is to 
use the Minimum Description Length (MDL) principle in segmentation unni 
E313- MDL has a strong fundamental grounding, being based on information- 
theoretic arguments: the simplest model explaining the observations is the best, 
and it can result in an objective function with no arbitrary thresholds. 

After defining the criterion function for labeling, the next problem is com- 
puting the solution to the optimization problem. Generally speaking, finding 
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the globally-consistent image labeling is an NP hard problem; therefore, approx- 
imation algorithms are needed in solving any segmentation problems of realistic 



ulated annealing, the choice of the temperature sequence involves a tradeoff 
between the convergence speed and the correctness of the result. In general, the 
temperature must be lowered at a very slow (logarithmic) rate; however, it has 
been shown that the temperature can be lowered more rapidly {e.g., exponen- 
tially) if moves are selected from a size distribution proportional to the Cauchy 
distribution HE!. Despite these insights, the simulated annealing approach to 
segmentation remains prohibitively slow. 

Chou and Brown |3 used highest confidence first (HCF) to infer a unique 
labeling from the a posteriori distribution that is consistent with both the prior 
knowledge and evidence. In the HCF algorithm, the computational complexity is 
generally less than that needed to obtain similar quality segmentation results via 
the simulated annealing algorithm m- In HCF, the number of different merging 
configurations tested is O(n^), where n is the number of regions in the initial 
over-segmentation of the image m- 

The above mentioned work leads to the development of our approach. De- 
formable shape templates are used to partition the image into a globally con- 
sistent interpretation, determined in part by the MDL principle. The highest 
confidence first algorithm will be used to obtain an approximation to the glob- 
ally consistent labeling of the image. As will be described in the next section, two 
extensions of this basic approach are needed to make segmentation efficient and 
accurate: 1.) shape indexing trees for faster model fitting, and 2.) perceptually 
motivated region splitting methods for (generally) more accurate segmentation. 

3 Basic Approach 

In [26j we proposed a method for region merging that uses a deformable model 
to guide grouping of image regions. We would like to review it briefly here. 
As show in Fig. □, the deformable model-based segmentation system includes a 
pre-processing (over-segmentation and edge detection) stage, and a model-based 
region grouping stage. 

In the pre-processing stage, the input image is over-segmented via standard 
region merging algorithms m- An example input image, and the resulting 
over-segmented output are shown in Fig. E The output of this module includes 
a standard region adjacency graph. An edge map is also computed; notable edges 
and their strengths are detected via standard image processing methods. The 
resulting edge map will be used to constrain consideration of possible grouping 
hypotheses later in region merging. 

The system then tests various combinations of candidate region groupings to 
obtain an optimal labeling of the image. The shape model is deformed to match 
each grouping hypothesis gi in such a way as to minimize a cost function: 



size. Simulating annealing methods are frequently used 




. In sim- 



— ^^color “t” (1 ^)((1 0)^area “t” P^deform) 
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Fig. 1. Basic system overview. The color image (image of bananas) undergoes pre- 
processing, which results in an over-segmentation and an edge map. These are inputs 
to the model-based region grouping stage (using a banana template). The nal output 
includes region groupings for detected objects (four bananas), and recovered models for 
the objects. 



where a and fi are scalar constants with values in the range [0, 1] that control 
the relative importance of the three terms: Ecoior is a region color compatibility 
term for the region grouping, Earea is a region/model area overlap term, and 
Edeform is a deformation energy for the shape model. A model fitting procedure 
is used to compute the cost in Eq. Evia the downhill simplex method. 

The deformation term enforces a priori constraints on the amounts and types 
of deformations allowed for the template; i.e.: 

Edeform OC log T*(<l| f?) , (2) 

where P(a|l7) gives the prior distribution on global deformation parameters, a, 
for a particular shape class 17. In our experience, the use of a Gaussian model 
for the prior distribution on global deformation leads to reliable shape-based 
image segmentation. An estimate of the prior distribution is computed in a 
supervised fashion, for a given set of training examples for that shape class. In our 
implementation, linear and quadratic polynomials are used to model deformation 
due to stretching, shearing, bending, and tapering. 

Further, to test the quality of a possible partitioning, a global cost function 
for partitioning the whole image is defined: 

n 

'S = (1 - 7) X! >^iE(g,) -b yn, 

2 = 1 



( 3 ) 
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where 7 is a scalar constant, n is the number of the groupings in the current 
image partitioning, is the ratio of group area to the total area, and E(gi) 
is the cost function for group gi (Eq. The highest confidence first (HCF) 
algorithm is used to find an approximately optimal value for Eq. 0 Region 
groupings obtained via HCF and recovered shape models are shown in Fig. 0 



4 Index Trees 

One problem with the system proposed in Section 0 is that segmentation can be 
slow for images of moderate complexity. This is because the shape model fitting 
procedure must be invoked many times in order to get the cost values of different 
configurations. Although we utilize methods to speed up the fitting procedure, 
such as multi-resolution fitting, and caching deformation parameters, most of 
the CPU time (over 90%) is still used in model fitting. We therefore propose to 
use an index tree method to accelerate the model fitting procedure. 

The basic idea behind index trees can be explained as follows. We first gener- 
ate many deformed instances of the object class by sampling in the deformation 
space according to the prior distribution of the deformation parameters. We 
then compute a shape feature vector for each generated instance. In our im- 
plementation, the features employed are the seven normalized central moments. 
The shape feature vector and the deformation parameters are stored with the 
instance. Then, in the fitting process, we compute the shape feature vector for a 
potential region group. Via comparing the shape feature vectors, the most similar 
one for the region group is fetched from the set of generated instances (called an 
instance set). Its associated deformation parameters are used as the parameters 
for the region group, or as a starting point to invoke a refining process. 

To speed up search, we organize the instances in a tree such that the retrieval 
time can be logarithmic to the number of instances. We use a hierarchical cluster- 
ing method (minimum variance) to process the shape features of the instances, 
and get the tree structure m- In our experiments, we have used the cophenetic 
correlation coefficient (CPCC) ^ to validate clustering. Although we uniformly 
sample in the deformation space, there is indeed a hierarchical structure in the 
corresponding shape feature space. 

By searching for the best match in the index tree, the searching time is 
reduced but it does not guarantee that the nearest match is always found. We 
tried to use the mean feature of instances in each non-leaf node to select a 
branch and go to the next level. However, the covariance and distribution for 
the instances in each node are not the same; furthermore, their distributions 
are not Gaussian in general. In order to overcome this problem, we use linear 
discriminant functions j 1 3j at the non-leaf nodes. Our experiments verified that 
this method can increase the success rate in finding the nearest neighbors. 

Another problem with index tree search is that the retrieved result is the 
nearest neighbor in the shape feature space. However, the distance metric in 
shape feature space is not the same as the fitting cost (Eq. Q nor is it monotonic 
to the fitting cost. A neural network (NN) can be used to map from the difference 
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in the shape feature space to the fitting cost measure. We use a three layer back- 
propagation network with bias terms and momentum m- The NN is only used 
for mapping in the leaf nodes of the index tree to reduce the on-line computation. 
In summary, a linear discriminant based on the shape feature vector is used in 
the first level (coarse level) of the index tree, and the NN mapping from shape 
feature difference to fitting cost is used in the second level (fine level) . 

The index tree approach was tested on over one hundred cluttered images of 
objects taken from a number of different shape classes. It was observered that the 
CPU time needed for segmentation was decreased by one order of magnitude, 
while the number of errors in segmentation did not increase appreciably over 
HCF without index trees. 



5 Shape-Based Region Splitting 



The shape-based region merging may not yield accurate segmentation in all 
cases. A common problem is that in some images there is no distinguishable 
boundary between touching or overlapping objects; therefore, it is not possible 
to separate them in the over-segmentation stage. If we only use model-based 
merging operations in the second stage, then the correctness of the final result 
will depend on the correctness of the initial over-segmentation. Therefore, a 
model-based region splitting operation is needed. 

A natural question is: How does the human vision system parse regions into 
parts ? Perhaps such strategies can be adapted to solve our problem. Some the- 
orists postulate that there is a set of basic shape primitives [314128131) that are 
useful in finding parts and describing them. Others postulate that there are 
rules, based on geometric properties alone, by which humans perceive bound- 
aries between parts for a given shape I18I35I . In our system, we make use of both 
shape constraints (our deformable template and statistical priors) and geometric 
properties of the region contour (curvature minima) in guiding splits. 

The first problem is determining if a region should be split at all. Candidate 
regions for splitting can be detected based on the model fitting cost value (Eq. 
and a specified threshold. The threshold can be obtained through statistical 
analysis over a example set. If the threshold is lower than required, it does not 
degrade the accuracy of segmentation; instead, more possible splits have to be 
tested, but in our experience the resulting segmentation is similar. 

Given a candidate region for splitting, possible cuts need to be determined 
and their feasibility tested. Following cut points can be chosen as concave 
cusps and negative minima of curvature of the region’s boundary. In human 
perception, it has been found that there is a preference for certain cuts over 
others UEEH]. For instance, there is the short cut rule PE): if boundary points 
can be joined in more than one way to parse a silhouette, we prefer the parsing 
which uses the shortest cuts. Thus, a cut is defined to be (1) a straight line which 
(2) crosses an axis of local symmetry, (3) joins two points on the outline of a 
silhouette, such that (4) at least one of the two points has negative curvature. 




Shape-Guided Split and Merge of Image Regions 373 



To select the best cut, the direct strategy is to test all possible cuts, and 
choose the split that yields the greatest decrease in the global cost (Eq. |3). 
The drawback of this exhaustive search is the computational requirement. If 
the number of candidate cuts is small, then the strategy is suitable. In general, 
however, we employ modified short-cut rule: select the shortest cut, if this cut is 
a feasible cut, stop; otherwise, delete the candidate point with least significance 
from this cut, and repeat to test the short cuts in the remaining candidate 
points until a feasible cut is found or there is no cut to test. The significance 
of a candidate point can be defined as the distance between the point and its 
corresponding point on the model boundary. 



6 Experiments 



The system has been implemented and tested in a color image segmentation ap- 
plication that uses 2D shape models and global deformations. In the implemen- 
tation, the prior distribution on global deformations for each shape is assumed 
Gaussian, and estimated using region segmentations provided in a training set. 
The system was tested on cluttered images of objects taken from a number 
of different shape classes {e.g., fish, blood cells, leaves, fruit, and vegetables), 
and results are encouraging. Example input images and segmentation results 
obtained with the system are shown in Fig. 0 

To quantitatively evaluate performance of the system, we also employed syn- 
thesized imagery. Fig. 0 shows an example taken from our database of synthetic 
fish images. The shape deformation parameters for generating fish were ran- 
domly sampled based on the distribution information obtained from the fish 
model training stage. We also added 5% white noise to the synthesized images, 
and triangulation methods were used to assign different color values to the dif- 
ferent parts of the generated fish. As a result, in the over-segmentation results 
shown in the figure, every fish object breaks into many small regions. From the 
figure, we can see that there are some incorrect mergings during the HCF region 
merge stage, and that the boundaries of some fish are imprecise. The results are 
improved after model-based splitting and re-merging. 

To evaluate the improvement gained by using HCF split /merge over HCF 
with merging only, we conducted experiments using a database of 22 leaf im- 
ages (210 leaf objects), 20 synthesized fish images (160 fish objects), and 21 
blood cell micrographs (about 700 cell objects). The success rate of object de- 
tection and mean fitting cost for objects before and after the splitting stage are 
shown in Table 0 The mean fitting cost can be regarded as a measure of the 
boundary accuracy in object detection. As is evident in the table, segmentation 
improves when model-based splitting is used, followed by a merge step. Com- 
bined split/merge yielded at least a 7% improvement in the object detection 
rate, as well as a uniform reduction in model fitting cost. 
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Fig. 2. Examples of segmentation obtained with images from the test database; (a) 
original images, (b) initial over-segmentation (note some regions overlap more than one 
object), (c) shape-based region merging result obtained via HCF, (d) result obtained 
via split followed by another merge step, (e) recovered shape models. 
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Table 1. Statistical analysis of region splitting algorithm for the method based on 
distance from model boundary and short-cut rule. See Sec. El for discussion. 



Accuracy Before Region Splitting 





images of leaves 


images of cells 


images of fish 


mean fitting cost 


1.1338 


1.1863 


1.0626 


success rate of object detection 


85.24% 


85.79% 


76.25% 



Accuracy After Region Splitting 





images of leaves 


images of cells 


images of fish 


mean fitting cost 


1.1118 


1.0531 


1.0528 


success rate of object detection 


92.86% 


93.90% 


85.00% 
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Abstract. In this paper we propose a series of novel morphological operators that 
are anisotropic, and adapt themselves to the local orientation in the image. This 
new morphology is therefore rotation invariant; i.e. rotation of the image before or 
after the operation yields the same result. We present relevant properties required 
by morphology, as well as other properties shared with common morphological 
operators. Two of these new operators are increasing, idempotent and absorbing, 
which are required properties for a morphological operator to be used as a sieve. 

A sieve is a sequence of filters of increasing size parameter, that can be used to 
construct size distributions. As an example of the usefulness of these new operators, 
we show how a sieve can be build to estimate a particle or pore length distribution, 
as well as the elongation of those features. 

1 Introduction 

When analyzing images without a preferred orientation, or images with an unknown 
orientation (as is the case, for example, of an image acquired after placing a sample 
randomly under a microscope), it is desirable to use rotation invariant operations. A 
rotation invariant operation yields an output that is independent of the orientation of the 
sample with respect to the sampling grid. There are three different ways of constructing 
rotation invariant operators: 

- using a single isotropic operator (the kernel itself is rotation invariant), 

- using a data-driven anisotropic operator (the kernel is anisotropic, but is oriented to 
the local gradient in the image), or 

- by combining a set of anisotropic operators. 

Non-rotationally invariant filters will almost certainly produce incorrect results if 
they are not aligned with the image under study, and an isotropic filter is often limited 
in its capabilities. Therefore, it is worthwhile to study rotation invariant operators based 
on anisotropic kernels. 

For example, consider an isotropic morphological closing, which has a disk as the 
structuring element (we regard 2D images for now). If we apply such a filter to an image 
with dark objects, such as the microscopical image in Fig.Ql all dark objects smaller 
than the structuring element will be removed from the image. If we see the image as a 
landscape where the dark features are the valleys and the light ones the hills, as in Fig.Ql 
we can imagine the closing as filling up the valleys such that no valleys remain in which 
the structuring element cannot fit (see Fig. 0. 
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This closing operation can be used as a sieve to detect features larger than a certain 
size. The problem is that this size is only determined by the smallest diameter of the 
features. To measure length, an anisotropic structuring element is required. 

In this paper we will introduce new morphological operators based on isotropic 
structuring elements with a lower dimensionality than the image under study (and thus 
anisotropic in the space of the image). By dropping one or more dimensions, the struc- 
turing element gets some degrees of rotational freedom that allows it to align itself with 
the features in the image. By selecting the orientation that causes minimum or maximum 
response (pixel by pixel), we create a rotation-invariant operator. In the two-dimensional 
case, the structuring element would be one-dimensional, with one degree of rotational 
freedom. A closing in this new morphological framework would remove an object only 
if the line element could not fit. This would mean that its largest diameter (supposing 
convex objects) is smaller than the structuring element (see Fig. Oil- 

We will call this new morphological framework Rotation-Invariant Anisotropic (RIA) 
morphology. We can call it morphology because it satisfies the four principles of mor- 
phology H|: 

- Translation invariance, 

- Compatibility under change of scale, 

- Local knowledge, and 

- Semi-continuity. 

The first three principles are expressed as properties of the operators in Sect. 01 and 
proven elsewhere 0. The principle of semi-continuity requires that the theory in the 
continuous world has an approximate counterpart in the discrete world, and is responsible 
for this theory to be applicable in practice [Gi|. Although the discretization of the operators 
presented here is beyond the scope of this article, it certainly is possible to apply these 
operators to discrete images. 

In Sect.Elwe will apply the new closing and opening introduced here to do segmen- 
tation-free measurements using morphological sieves. Sieves are used to build multi- 
parameter (length, width, depth) size distributions that characterize the shapes of objects, 
structures or textures in grey-value images. These measured distributions can be used for 
image recognition or characterization, and are applicable in a wide variety of situations. 

2 Definitions 

In this paper we use the notation as specified in Table G] We will use Greek characters 
(especially p and 0) for rotation angles, and Latin characters (especially x and y) for 
translation vectors and image coordinates. / and g denote continues functions — ?> R 

(the image being processed). Vectors are not distinguished typographically because it is 
obvious from the context which variables are vectors and which ones are scalars. 

2.1 Dilation 

A flat, isotropic structuring element D of radius r can be decomposed into (an infinite 
amount of) rotated line segments of length £ = 2r. The dilation then becomes, 
with(/j £ [0,7t), 
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Fig. 1. A portion of the image ‘cermet’, after some processing. 




Fig. 2. The image from Fig.lU after closing with a circular structuring element. 




Fig. 3. Result of the new, rotation invariant, anisotropic morphological closing applied to the image 
in Fig.n Compare with the result of the isotropic closing in Fig.Q 
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Table 1. Notation used in this paper 



fn> 


rotation of / over an angle ip 


fx 


translation of / over x'. fx{t) = f{t — x) 


fp,x 


rotation of / over an angle ip and then a translation over x 


Sbf 


scaling of / with a factor 6: Sbf{x) = /(f ) 


© 


Minkowski addition 


s 


transpose of S': S{x) = S{—x) 


Ssf 


dilation of / with S, any flat structuring element: 
Ssf = f®S = \/^^sfx 


£sf 


erosion of / with structuring element S 


isf 


opening of / with structuring element S 


<!>sf 


closing of / with structuring element S 


ll> 

to 


definition: “Let A be defined as B” 


A = B 


equality by definition: “A is equal to B by definition” 



Snf = f(BD = f(B\jL^ = \/{f(BL^) = \/ V f- ■ (D 

(fi ip ip xeLip 

Note that we ignore the transpose operation since D = D and L^ = L^. 

Based on this, we define a new morphological operator, which we will call RIA 
dilation, and denote with the symbol , 

Slf = /\\/ U = /\SlJ . (2) 

ip xeLtp p 

This operator takes the maximum of the image over a line segment rotated in such 
a way as to minimize this maximum. Figure 0 gives an example of the effect that the 
operator has on an object boundary. Note that a convex object boundary is not changed, 
but a concave one is. 

We like to compare this dilation operator with a train running along a track. The 
train wagons (which are joined at both ends to the track) require some extra space at 
the inside of the curves. This dilation, applied to a train track, and using a structuring 
element with the length of the wagons, reproduces the area required by them. 

2.2 Erosion 

RIA erosion is defined as the dual of the RIA dilation, and will be denoted with the 
symbol e"^. 

etf = -Sti-f) = -/\ V (-/-) = V A = ■ 

p x^Lip p x^Ltp p 



( 3 ) 
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Fig. 4. Effect of the RIA dilation on an object boundary, a: The original boundary, b: The boundary 
after the dilation, together with the line segment used as a structuring element, c: Construction of 
the dilated object boundary. 



2.3 Closing 

The closing is usually defined as a dilation followed by an erosion, 

<pDf = SdSdI ■ ( 4 ) 

However, it is easier to understand (and modify) if we see it as the maximum of the 
image over the support of the structuring element D, after shifting it in such a way that 
it minimizes this maximum, but still hits the point t at which the operation is being 
evaluated, (see Fig. ^). Or, in other words, the ‘lowest’ position we can give D by 
shifting it over the ‘landscape’ defined by the function /: 



<t>Df= A y fy 


^ A ( V ) =^DSof 


xeDyeDx 


_ x&D \v^D 



In accordance to this, we define a new morphological operation, RIA closing, as the 
‘lowest’ position we can give the linear structuring element L, by shifting and rotating 
it over the ‘landscape’ /, such that it still hits the point x being evaluated (see Fig. m. 
It will be denoted by (j )^ , and defined by 

</>?/= A A V /v , (6) 

which is analogous to the definition of the RIA dilation, where we also changed the disk 
for a line, and added a minimum over the orientation of that line. As it turns out, this 
is the same as the minimum of the closings, at all orientations, with a line segment as 
structuring element. 



0t/ = A A 

V xeLip 




f\£L^5L^f = f\4>L^f , 
‘■P ‘P 



(7) 
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Fig. 5. a: The closing with an isotropic structuring element (disk) is determined by shifting the 
disk in such a way that it still hits the point being evaluated, and minimizes the supremum of the 
image over its support, b: The RIA closing is determined by shifting and rotating the line segment 
in such a way that it still hits the point being evaluated, and minimizes the supremum of the image 
over its support. 



but not equal to a RIA dilation followed by a RIA erosion. 

We will show elsewhere that this transformation is increasing, idempotent and 
extensive, and therefore we can call it an algebraic closing m . Moreover, Matheron has 
shown that any intersection of morphological closings is an algebraic closing |’5i|. We 
can interpret /\^<t>L^f as the intersection of an infinite series of closings, in which case 
the increasingness, idempotence and extensivity are proven by Matheron. For previous 
work using rotated line segments see Soille Q. 



2.4 Opening 

The RIA opening is defined as the dual of the RIA closing, and denoted by the symbol 

= = V (-/v) = VV A fv = \/iLj ■ m 

V? x^LipU^Lip^x ^ x^Lip y^L(p^x 

2.5 Extension to Higher Dimensionalities 

Until now we have only talked about operations on two-dimensional images. However, 
it is very easy to extend the RIA morphology to higher dimensionalities. For example, 
in the 3D case, it would be possible to have structuring elements with either one or two 
dimensions (i.e. a disk or a line segment); both have two degrees of rotational freedom. A 
closing with these two structuring elements can be used to measure the first and second 
largest diameters of the (convex) object: the line segment can not fit if it is longer than 
the largest diameter; the disk can not fit it is wider than the second largest diameter. To 
measure the smallest diameter, the isotropic closing would be used. 
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3 Properties 

Properties that are valid for all operators are only specified for the RIA dilation. Properties 
mentioned only for the RIA closing are by duality also true for the RIA opening but not 
for the dilation or erosion. 

Property 1. Translation invariance: 



The result of the operation is scaled by b if both the image and the structuring element 
are scaled by b. 

Property 3. Local knowledge: 



This property simply states that the result of the operator inside some window W\ is 
independent of the image outside some other window W 2 - This implies that W\ C W 2 - 

These first three properties are the cornerstones of morphology, without which it is 
not possible to define shape. Together with the principle of semi-continuity, they are the 
requirements for operators to belong to morphology. 

Property 4. Rotation invariance: 



Rotation invariance of the RIA morphology is a key property, necessary for the 
correct analysis of images with an unknown orientation, or images without a single 
dominant orientation. 

Property 5. Contrast invariance: 



This property can be taken further, by stating that both the RIA dilation and the 
RIA closing commute with any anamorphoses (which is defined as an increasing and 
continuous mapping M — >■ R) Qll- 

Property 6. Increasingness: 



nh = (stf). 



Property 2. Compatibility under change of scale: 

Sb5tf = 5tLS,f 



Wi-5t{W2-f) = W^-5tf 



5tfe = [Stf)e 



SUc-f) = c-Stf 
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Property 7. Extensivity / anti-extensivity: 

etf<f< 5tf 

Property 8. Extended extensivity: 

Property 9. Idempotence: 

Property 10. Absorption: 

Where is a linear structuring element with length li. 

This property states that applying a RIA closing at a large scale to the result of the 
RIA closing at a smaller scale yields the same results as applying it to the original image. 
Furthermore, applying other RIA closings at smaller scales after that has no effect. 

Note that idempotence is a special case of absorption, where = £2- Also, the 
comutativity of the RIA closing follows from the absorption property, since only the 
largest-scale operator influences the result, independently from the order in which they 
are applied. 

Property 11. Sieving: 



il<i2 

The sieving property is a requirement for granulometric applications, and is implied 
by the increasing, extensivity and absorption properties m- Basically, it states that all 
features removed at a smaller scale will also be removed at a larger scale. This allows a 
sequence of operators of increasing size to ‘sieve’ the features in an image and classify 
them according to size (see Sect.©. 

Property 12. Commutativity: 






This property follows from Prooerty MOl and does not hold for the RIA dilation and RIA 



erosion. 
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Property 13. Non-distributivity: Unlike the common dilation and erosion, the RIA di- 
lation and erosion do not distribute with the extremum operators. 

Property 14. Comparison with regular morphology: 

f<Stf<SDf 

f>etf>SDf 



4 Granulometry 

Since the RIA closing and opening comply with the sieving property (Property E]), it 
is possible to use them as sieving functions in a granulometric application. A sieve is 
composed of a sequence of morphological hlters with increasing size parameter El.The 
filters are applied either in series or in parallel (which produces the same result due to 
Property ITOI absorption), each one removing a group of image features of certain size. 
This size is directly proportional to the filter parameter, and the measure that determines 
this size depends on the filter construction. Because of the sieving property, each hlter 
removes all image features also removed by the smaller filters, and never adds new ones. 

The difference between the result of subsequent filters is called a granule image [7||, 
and contains only image features in a known size range. These granule images can be 
used to construct a size distribution. As said before, the measure used to determine the 
size of an image feature depends on the filter used. A closing with an isotropic structuring 
element (disk) measures the width of dark features. A RIA closing measures the length 
of dark features. Openings do the same with light features. 

The set of granule images form a scale-space, which allows to measure the size of 
the feature that each pixel belongs to. The ‘trace’ of a pixel through the scales is some 
sort of local size distribution, which gives (for example through a mean or median) a 
scale parameter for that pixel. By going through this process with different hlter types, 
we can assign different scale parameters to each pixel; for example the length and width 
of the pore that it belongs to. Knowing these values, it is easy to construct a distribution 
for the elongation. 

5 Conclusions 

We have dehned some new morphological operators, based on the premise that, by 
dropping one or more dimensions, an isotopic structuring element in a subspace becomes 
anisotropic in the full image space, but also gains some degrees of rotational freedom. 
This freedom can be used to have the structuring element align itself to the features in 
the image, and thus become rotation invariant. 

We have shown that the dilation with such a structuring element, giving it the orien- 
tation that causes the result to be maximal, is in fact an isotropic dilation. This comes 
from the fact that the isotropic structuring element is the same as the union of (an inhnite 
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amount of) lower- dimensional isotropic structuring elements with all possible different 
orientations. In contrast, if we give the structuring element the orientation that causes 
the result of the operation to be minimal, we get the dilation operator proposed here. 

In the same manner, we have defined a new erosion, closing and opening operators. 
We have stated that all of these operators are rotation, translation, scaling and contrast 
invariant, as well as increasing and extensive. We have also mentioned that the closing 
and opening defined in this article are idempotent, commutative and absorbing. These 
properties are important if we want to use the new operators in the same way we use 
other morphological operators. 

The morphological framework proposed in this article has been defined in two di- 
mensions, but it has been shown that it is easy to extend to higher dimensional spaces. 
In two dimensions, the closing and opening as defined here can be used to measure the 
length of image features. In three dimensions, different versions of the same operator can 
measure both the first and second largest diameters. The smallest diameter is measured 
in all cases using an isotropic structuring element. 

Finally, we have explored an example application for the new operators, that shows 
that they are useful in granulometric applications. 



References 

[1] J. Serra. Image Analysis and Mathematical Morphology. Academic Press, London, 1982. 

[2] C.L. Luengo Hendriks and L.J. van Vliet. A rotation-invariant morphology using anisotropic 
structuring elements, in preparation. 

[3] R. van den Boomgaard. Mathematical Morphology: Extensions Towards Computer Vision. 
Ph.d. thesis. University of Amsterdam, Amsterdam, 1992. 

[4] P. Soille. Morphological Image Analysis. Springer- Verlag, Berlin, 1999. 

[5] G. Matheron. Random Sets and Integral Geometry, lohn Wiley and Sons, New York, 1975. 

[6] P. Soille. Morphological operators with discrete line segments. In G. Borgefors, I. Nystrom, 
and G. Sanniti di Baja, editors, DGCI 2000, 9th Discrete Geometry for Computer Imagery, 
volume 1953 of LNCS, pages 78-98, Uppsala, Sweden, 2000. 

[7] J.A. Bangham, P. Chardaire, C.J. Pye, and P.D. Ling. Multiscale nonlinear decomposition: 
The sieve decomposition theorem. IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 18(5):529-539, 1996. 




Multiscale Feature Extraction from the Visual 
Environment in an Active Vision System 



Youssef Machrouh', Jean-Sylvail Lienard', and Philippe Tarroux'”^ 

‘ Situated Perception Group LIMSI-CNRS BP 133 F-91403 Orsay Cedex, France 
^ENS 45 rue d’Ulm F-75230 Paris cedex 05, France 
{Youssef .Machrouh, 

Jean- Syl vain . Lienard, Philippe . Tarroux}@limsi . f r 



Abstract. This paper presents a visual architecture able to identify salient re- 
gions in a visual scene and to use them to focus on interesting locations. It is in- 
spired by the ability of natural vision systems to perform a differential process- 
ing of spatial frequencies in both time and space and to focus their attention on a 
local part of the visual scene. The present paper analyzes how this differential 
processing of spatial frequencies is able to provide an artificial system with the 
information required to perform an exploration of its visual world based on a 
center-surround distinction of the external scene. It shows how the salient loca- 
tions can be gathered on the basis of their similarities to form a high level repre- 
sentation of the visual scene. 



1 Introduction 

The use of active mechanisms seems to be a way to improve the abilities of machine 
vision systems. Active systems search salient features in the visual scene through a 
dynamic exploration. They can direct their search toward the most meaningful stimuli 
using attentional mechanisms leading to a reduction of the computational load [1,2]. 
Thus, natural vision is a behavioral task, not a passive filtering process. An exploration 
of the visual world that relates perception and action allows for a labeling of the exter- 
nal space with natural landmarks associated with the exploratory behavior. In this 
respect, the relationships between agents and natural systems suggest that certain as- 
pects of natural perception can be successfully incorporated in artificial agents. 

Otherwise, during the past few years, several studies have been devoted to the un- 
derstanding of the essence of vision considered as an information processing mecha- 
nism [3]. This approach is grounded on Barlow’s proposal [4], which stated that the 
main organizational principle in visual systems is the reduction of the redundancy of 
the incoming stimuli. 

These considerations, issued from information theory, led several authors to ana- 
lyze the statistical organization of natural images. They demonstrated that natural 
images (those which do not exhibit any specific bias in their pixel distribution) have a 
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Stationary statistics and an auto-similar structure. As a consequence of these charac- 
teristics, their power spectra fall off as 1/f^ [5]. 

In this context, different authors [6, 7] demonstrated that a way to transform the ini- 
tial redundancy was to improve the statistical independence of the image descriptors. 
According to this hypothesis, an image can he viewed as a linear superposition of 
several underlying independent sources. 

The filters that provide this statistical independence can be computed through the 
application of the source separation adequate algorithms (Infomax, BSS, ICA). 

One can show [6, 7] that the optimal filters computed according to these principles 
are multiscale local orientation detectors similar to a Gabor wavelet basis [8]. 

However, although a lot of work has been devoted to the understanding of these 
theoretical bases of information processing in natural visual system, few attempts have 
been made thus far to use these principles in artificial vision systems. Practical im- 
plementations impose some limitations that require analyzing what is really obtained 
with simplified models based on these general principles. On the other hand, no artifi- 
cial vision system has been designed to include both multiscale wavelet analysis and 
differential spatial and temporal processing of spatial frequencies. A prerequisite to 
the design of such a system is to be able to characterize the information obtained from 
a bank of wavelet filters in different frequency channels. 

We thus analyzed here the information issued from various combinations of high 
and low frequencies of statistically uncorrelated signals. Our aim was to determine 
how to build a multivariate representation of the scene that allows a dynamic grouping 
of image points on the basis of their similarities in a given context and for a given 
task. 



2 System Overview 

2.1 Image Data 

A set of 11 natural images selected from a larger database was used in the present 
study. Pictures that include too many traces of human activity (buildings, roads...) 
were avoided. Only images with similar initial resolution (around 256x512 pixels) 
were retained. 




Fig. 1. Sample image from the set of natural images used in the present work. (Original size 
512x256) 

The images were discarded when their power spectrum did not fit the 1/f^ characteris- 
tics [5]. Figure 1 shows one typical example of an image used in the present study. 
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2.2 Initial Filters 

A guideline for this work was to retain among the filtering characteristics of the pri- 
mate visual system those that can be useful for the elaboration of an artificial system 
of situated and active vision. 

Two characteristics have attracted our attention: the elimination of image redun- 
dancy in the processing steps designed to maximize the statistical independence of the 
scene descriptors and the differences in the processing of spatial frequencies between 
the center and the surround of the visual field. 

The visual scene was filtered by a first bank of Gabor wavelets in four spatial ori- 
entations and four spatial frequencies (1/8, 1/16, 1/32, 1/64). For each initial image, 
we got 32 resulting images (two for each quadrature pair of each of the 16 Gabor 
filters). This multiscale processing was implemented using a Burt pyramid according 
to the method proposed by Guerin-Dugue [9] . 

For the purpose of this study and in order to obtain a complete view of what infor- 
mation is obtained from a detector during a systematic exploration of the visual scene, 
the whole scene was filtered by the entire bank of filters. In an operational system with 
a focal vision only a small part of these computations are needed. 



2.3 Simple Cells - Complex Cells 

An important distinction between the use of wavelets in image processing and the 
filtering steps in the visual system is the presence of strong non-linearities in the latter. 
Primary visual cortex shows several cell types according to the non-linearities they 
implement. Simple cells (SC) perform an additive combination of their inputs. They 
respond to an oriented stimulus localized at the center of their receptive field. The so- 
called complex cells (CC), on the contrary, exhibit a kind of translational invariance 
and respond to a stimulus whatever its position in the receptive field of the cell. 
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Fig. 2. Effects of filtering of the statistical independence criterion. Init: Initial image, SC: Sim- 
ple Cells, CC: Complex cells 
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Other cell types (mainly in extrastriate cortex) combine these outputs in order to be 
sensitive to curvature and terminations (end-stop cells). 

To model simple cells we used additive units with a zero threshold ramp transfer 
function which amounts to take into account only the positive part of Gabor filters. 
These cells do indeed not transmit the inhibitory part. 

According to Field [5], we modeled complex cells output as the norm of quadrature 
pair Gabor filters. We verified that this implementation effectively leads to a reduction 
of the redundancy for both cell types by a comparison of the kurtosis before and after 
filtering (Figure 2). Kurtosis is indeed a good measurement of the statistical independ- 
ence of a set of detectors [10]. 

A third type of detector with large receptive fields and designed to provide a con- 
textual information will be considered in the following section. 

In order to build a set of higher-level detectors suitable for the extraction of com- 
plex features we performed a Karhunen-Loeve transform of the outputs. A set of 1744 
image patches (5x5) extracted randomly from the initial 1 1 natural images was used to 
build these spaces. We thus obtained 8 eigen- vectors at the output of simple cells and 
4 eigen-vectors at the output of complex cells for each frequency band. These compu- 
tations amount to a non-linear principal component projection of the initial image 
performed with two different types of non-linearities. 



2.4 Global Energy - Local Context 

As stated above, we assumed the existence of detectors sensitive to the global energy 
in the different orientations. In an image region corresponding to the fovea, the system 
computes a global energy vector for each of the four orientations. This vector is used 
to build a signature that can be used to classify the region. Such an analysis provides 
us with contextual information [11, 12]. We consider the identification of these con- 
texts as a prerequisite for the recognition of objects. The importance of contextual 
information in natural systems can be deduced from the experimental observation that 
object recognition is effectively facilitated if the objects are viewed in congruent con- 
texts [12]. 

Thus, the system computes three output sets on each image: (i) an output directly 
issued from the Gabor filters filtered by a ramp function (SC), (ii) an output giving the 
local energy at the output of these filters analogous to the output of complex cells 
(CC) and (hi) a large field output providing contextual information. 



3 Results 

3.1 Simple Cells 



For each image point the system provides a high dimensional vector made of 32 ori- 
entation components spread over 4 frequency bands for SC detectors and 16 orienta- 
tion components in 4 frequency bands for CC detectors. 
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Although Gabor detectors maximize the statistical independence of their outputs, in 
practice they are not strictly independent. The analysis of these outputs through a 
Karhunen-Loeve transform leads to a data representation basis that sorts the represen- 
tations according to their greatest statistical significance. 

The first axis corresponding to the highest eigen-value shows highly variable details 
from one scene to another (figure 3 left). It emphasizes details related to the structures 
present in the scene. This probably results from the fact that these structures are cor- 
related in a given scene due to the correlation induced by the presence of objects. They 
are uncorrelated from one scene to another because each scene has a different 
organization. 




Fig. 3. Output of SC filters: projection of the output along the first (left) and the last (right) 
eigen- vector of the output space 



On the contrary, details filtered by the axes corresponding to the lowest eigen- 
values (figure 3 right) are expected to weakly contribute to the total variance. They 
correspond to features most frequently observed from one image to another. 




Fig. 4. Eigen-images from CC filters. The images are computed as the projection of the CC 
outputs on the eigen-vectors defining the output space of these filters. Columns range from high 
to low frequencies (from left to right: 1/8 to 1/64). Lines show the filter outputs along the prin- 
cipal components (top: highest variance, bottom: lowest variance). 



The same region revealed by the first projection axis (Figure 3 left)(% initial vari- 
ance: 29.4%) of the KL transform and the last projection axis (Figure 3 right)(% initial 





Multiscale Feature Extraction from the Visual Environment 



393 



variance: 2.47%) shows that, while the first axis tends to reveal long edges that con- 
tribute significantly to the general structure of the objects, the last axis tends to reveal 
termination and curvature points that are not characteristic of the image structure. 

We obtain a complex set of features along the different axes. The most representa- 
tive of the presence of objects correspond to the first axes. On the others, features 
representing complex combinations of stimuli frequently observed in natural images 
seem to be sorted according to their level of abstractness. 



3.2 Complex Cells 

The same transform can be applied to the output of complex cells. Figure 4 shows the 
main axes of the KL transform following the computation of the Gabor norm for dif- 
ferent spatial frequency bands. 

The projection axes (rows in the figure) extract distinct features from the initial im- 
age as well within the same frequency band (rows) as between different frequency 
bands (columns)(note that for instance the building vanishes in axis 3 projection. Fig- 
ure 4 3' row). These features are entirely different from those extracted by the output 
transform of SC. 

One can observe that high frequency details disappear in low frequency channels 
except for objects that exhibit frequency similarities (high frequency details repeated 
over a large area like the building). 

Objects in the foreground, which are apparently characterized by low frequencies, 
appear in low frequency channels while they are not represented in high frequency 
band. Low frequency channels are able to distinguish features that have some spatial 
extension (the building or the foreground bushes). 

A comparison of the lowest frequency channels (Figure 4 right column) shows that 
the locations revealed on the different axes are largely uncorrelated, thus correspond- 
ing to different points of view on the scene. 

The lesser number of low frequency features (figure 4 right column) defines a small 
set of landmarks able to characterize the visual space and to guide exploratory sac- 
cades. This low-frequency information is the only one available in the periphery of the 
visual field. 



3.3 Correlation between Channels 

One of the important questions raised by this analysis is how different are the indices 
obtained from the different frequency channels. If two channels correspond to the 
same combination of basic features, the corresponding eigen-vectors should be simi- 
lar. Thus, a measure of the similarity between the eigen-vectors in different frequency 
bands is given by the product of the eigen-matrices in these frequency bands. Using 
this method we compared the output spaces of respectively simple and complex cells 
for different frequency bands. We obtained strongly different results for the compari- 
son of output spaces in SC channels and in CC channels. 
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For simple cells, the correlation between the axes of the spaces corresponding to 
different frequencies are low and distributed over the different axes (data not shown) 
while in complex cells the respectively high and low frequency bands exhibit similari- 
ties (table 1). 

Table 1. Analysis of the output space for CC detectors. The eigen-vectors corresponding to the 
same axes show a very high correlation between respectively high and low frequency channels. 
The cross-correlation between eigen vectors corresponding to different axes is usually low (not 
reprinted here) 



Frequencies 



Axes 


fO/fl 


f0/f2 


f0/f3 


fl/f2 


fl/13 


f2/f3 


FI 


0.990 


0.442 


0.410 


0,365 


0,330 


0,996 


F2 


0.997 


0.517 


0.501 


0.507 


0.486 


0.997 


F3 


0.991 


0.363 


0.370 


0.425 


0.424 


0.995 


F4 


0.994 


0.656 


0,653 


0.641 


0.630 


0.996 



These results lead to the conclusion that the combination of simple cells outputs 
across the frequency bands underline uncorrelated details, whereas the outputs in high 
(resp. low) frequency bands correspond most frequently to similar stimuli. 

A pyramidal decomposition of the scene allows combining these characteristics to 
identify spatial positions characterized by spectral compositions as diverse as possible. 

This diversity seems to lead to a greater separability of these spatial positions and 
seems to be able to facilitate object discrimination. 



3.4 Identification of Global Contexts 

Cells sensitive to low frequencies have large receptive fields. However, in higher 
layers of the visual system cell types that encode intermediate representations also 
exhibit larger receptive fields. They combine the output of the cells in the preceding 
layers and gather the information coming from brighter regions of the visual field. 

A vector that combines the global energy components associated with each fre- 
quency channel provides a suitable code for representing the whole fovea. It has been 
shown that such vectors can be used to classify visual scenes according to the context 
they belong to [11, 12]. In the present study, we build such detectors in computing the 
mean energy provided by the output of CC cells in the four frequency bands already 
mentioned. 

To determine how spatial indices provided by the channels previously described 
can be used for the identification of interesting locations in the scene, we performed 
the following experiment: 

A set of salient locations is computed from the eigen-images defined previously. 
Points in the image are selected at random or on the basis of these salient locations. At 
each point the mean energies of the CC outputs in an image window corresponding to 
the fovea were computed for each frequency. We thus obtained an energy vector for 
each of the selected point. A PC A analysis was performed on this set of vectors. One 
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should keep in mind that this use of PC A differs from its use in the previous sections. 
The Karhunen-Loeve transform was previously used as a self-organization tool lead- 
ing to a set of linear combination defining complex features frequently occurring in 
natural images. In this section, PCA should be considered as a means to analyze the 
structure of the space at the output of the SC and CC filters. 




Fig. 5. Clustering of fixation points corresponding to different regions of the visual scene. 
Clusters were identified on the first three principal components and the fixation points corre- 
sponding to each cluster plotted on the diagrams at their position in the initial image (a), (b) 
Fixation points obtained from the second eigen-image and the second frequency channel shown 
Fig. 4. The other diagrams show the location of some clusters gathering salient points on the 
basis of their spatial frequencies and orientation properties: (c) trees and bushes, (d) building, 
(e) strong curvature at the border between hill and sky (f) another region of interest at the same 
limit 



When the locations in the image are selected at random no obvious structure were 
observed in the PCA space. On the contrary, when they are selected on the basis of 
their saliencies, clusters were identified in the PCA space. Figure 5 shows the loca- 
tions of some of these clusters on the original image. Points corresponding to a similar 
context are grouped into the same cluster. The example shows for instance the ability 
of the method to separate fixation points on the basis of their natural or artificial na- 
ture (Figure 5 c and d). 

It should be noted that Figure 5 shows only a small sample of the structures that can 
be identified. Only 1/16 of the available dimensions is presented here. Thus, the 
method transforms the initial image into a huge set of clusters each characterized by 
similar spectral signatures. 
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4 Discussion and Conclusion 

The visual filter system proposed in the present work produces a set of features that 
can be used to guide the exploration of the external scene. The features extracted by 
the non linear combination of SC channels seem rather suitable for object recognition. 
Features obtained from the computation of local energy (CC channels) allow a parti- 
tion of the image into salient regions arranged according to their frequency composi- 
tion. The computation of the global energy provides local context information and can 
be used to segment the scene on the basis of its spectral characteristics. 

Thus, the output of this filtering system provides on one-hand locations of interest 
able to guide an attentional system and on the other hand clusters of locations arranged 
according to their spectral signature. 

This approach can be considered as an extension of textures segmentation methods 
[13] to the question of the identification of contexts and an extension of the method 
proposed by Herault [11] to the analysis of local contexts. However, it emphasizes the 
relativity of the context notion; the segmentation of the visual scene in (i) a global 
context and (ii) objects is an oversimplification 

The visual scene is thus scattered into a set of projections on several disjoint sub- 
spaces. In each of these subspaces, salient points form clusters according to their 
similarities. These salient points are projected into disjoint sets of clusters and the 
corresponding objects can thus be grouped according to different points of view. 

An object class is not characterized by a unique high-level representation, but by 
the transient association of a subset of properties. This association can thus dynami- 
cally depend on the current task. Objects are not considered as similar and grouped on 
the basis of their intrinsic properties but according to those of their properties linked to 
a given goal. 

A further step in this work will be to demonstrate how such coding abilities could 
indeed facilitate object classification. This requires incorporating the present algo- 
rithms in the control architecture of a perceptive agent such that it can build a hierar- 
chy of perception-action links based on the dynamic grouping of the perceived fea- 
tures. 
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Abstract. We propose a method to generate component-based shape 
descriptions by the application of a perceptual grouping approach known 
as tensor voting. Based on previously described results on the generation 
of region, curve and junction saliencies and motivated by psychological 
findings about shape perception, we introduce extensions by a voting 
between junctions to create amodal completions, by a labeling of the 
junctions according to a catalog of junction types, and by a traversal 
algorithm to collect the local information into globally consistent part 
decompositions. In contrast to commonly used partitioning schemes, our 
method is able to create layered representations of overlapping parts. We 
consider this a major advantage together with the use of local operations 
and low computational costs whereas other approaches are based on 
highly iterative processes. 



1 Introduction 

Several classic publications - e. g. HEZHI, EiHTI, EEHa, and - em- 

phasize that the search for appropriate shape descriptions could be regarded as 
the key problem of computer vision. As shown by |H1N88| and |H om98| . a repre- 
sentation scheme with expressiveness, discriminating power, stability, invariance 
against viewing conditions and occlusion necessarily has to be a component- 
based description. However, existing algorithms for segmentation and the gen- 
eration of shape descriptions have several drawbacks which will be discussed in 
the following review of fundamental approaches. 

Many methods can be traced back to a symmetric axis transform (SAT) 
introduced by fBlii67j and |HN78j leading towards a skeletal form representation. 
Main disadvantages are, besides the difficulty of computation, sensitivity to noise 
and problems with the handling of overlapping forms (Fig. ^-c). Furthermore 
the aspect of symmetry is overemphasized leading to disagreements with recent 
psychophysical results on shape perception which see the relation to the Gestalt 
law of good form as follows: “. . . simplicity and regularity of form are outcomes, 
not causes, of the unit formation process” IKHnU. 

Other techniques are based on the transversality principle described by 
[lHH,84j which postulates part decompositions at points of maximal negative 
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Fig. 1. Skeletal form representations have problems with a) the presence of gaps, 
b) overlapping parts, and c) noise (after csmi). Curve evolution approaches can 
yield intuitively implausible parts - body and tail in (d) - as opposed to the desired 
segmentation (e), after 



curvature along a contour. Even though computation of extremal curvature is 
considered to be based on ecological vision theory, negative curvature can only 
give preliminary indications where to split segments without being a sufficient 
condition. Hence, it commonly yields too many decompositions due to the fact 
that all possible locations for segmentations along the contour are examined in- 
dependently from each other without the consideration of combinations between 
potentially cooperating positions. 

According to Em and Em the duality of extremal curvature and sym- 
metric axes can be expressed by a so-called process-grammar describing a form 
as a history of deformations from an initial shape. This idea of evolving con- 
tours has been further explored by IKTXOni with analogies to physical reaction- 
diffusion models and by k!kM(17l . An improved curve evolution scheme which 
prevents the diffusion from blurring corners and dislocating feature positions 
can be found in |LL99| . While curve evolution models provide an elegant for- 
malism and nice results for additive parts, they still tend to yield undesired 
segmentations (Fig. 0i-e) and have deficiencies in handling overlapping parts 
and negative parts (i.e. bridging indentation gaps) because interactions between 
related segmentation positions along the contour are neglected. Additionally, the 
input usually is required to be a closed contour. Furthermore, the edge polar- 
ity is not taken into consideration despite of its importance observed in human 
shape perception (e. g. a Kanizsa-Triangle with outlined circles does not yield 
the perception of virtual contours due to the absence of polarity information). 
Above all, the inherently iterative process seems to be inconsistent with the phe- 
nomenon of spontaneous shape perception known since |Kof35j and |Pet56j . The 
function of the iterations is rather a contribution to different abstraction levels 
of a multi-scale representation. 

Serious drawbacks of all mentioned segmentation schemes arise from the 
strategy to decompose the input image into disjunct partitions, which prevents 
them from an adequate handling of occlusions. For that reason, we intend to 
find a method which facilitates the generation of overlapping forms. 

Several aspects discussed here can be solved by employing methods of per- 
ceptual grouping. Firstly, there is strong support from the theory of visual in- 
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terpolation theory established by Kellman and Shipley on the basis of an 

extensive collection of psychological findings about shape perception (see also 
EESl). Using the relatability of discontinuities, they achieve a unified explana- 
tion for form completion in the case of partly occluded objects, virtual contour 
illusions and transparent occlusions. Accordingly, these phenomena, where con- 
nections of edge elements occur even across areas without physically existing 
edges, are based on a common grouping process for unit formation. Secondly, 
there is an adequate means for the realization of perceptual grouping provided by 
the tensor voting technique which has been developed by |M brOO) and jGMhbj 
and belongs together with [SUhtij . [SHHdj . and frWhtij to the field of perceptual 
saliency theories. Tensor voting has the following advantages over similar meth- 
ods and over techniques from other fields, e. g. regularization, consistent labeling 
as in !ba,u99|. clustering, robust methods, and connectionist models: The method 
is not iterative, does not require manual setting of starting parameters, handles 
the presence of multiple curves, regions, and surfaces at the same time, it is 
highly stable against noise and still preserves discontinuities. The only param- 
eter is scale which is in agreement with the scale-dependency of human shape 
perception. The universality of tensor voting has been demonstrated in jMLTOOj 
by application to a bunch of early vision problems. In contrast to commonly ap- 
plied vector fields, this method profits from the definition of appropriate tensor 
fields which represent the information propagated from sparse input locations 
into their neighborhoods. All these fields are combined by means of a tensor ad- 
dition (in contrast to a vector addition) which simultaneously yields the saliency 
for junctions, curves, and surfaces. The result of the perceptual grouping is de- 
rived by extracting the maxima in the saliency maps, for curves and surfaces 
a marching algorithm is employed to trace these structures along their highest 
saliencies. 

2 Shape Descriptions from Perceptual Grouping 

2.1 Overview 

We will present a method to derive shape descriptions by decomposition of forms 
into multiple, possibly overlapping layers. Such a representation is supposed to be 
a natural solution for the handling of occlusions. It is not only capable to extract 
the overlapping parts but even allows estimations of their depth placement. 
The application to complex objects and object recognition tasks clearly will 
profit from the higher level of the representation compared to the matching of 
uncombined features. However, this has to be regarded as a long term aim. Here, 
we will discuss the foundations necessary for the achievement of layered shape 
representation. Therefore, we have chosen as the most basic case input images 
consisting of binary polygonal shapes. It is important to note that this reduction 
does not bound the proposed method to this domain but rather shows its basic 
properties; see section 01 for possible extensions. 

As mentioned before, this kind of silhouette images has been studied in psy- 
chological experiments under the term spontaneously splitting figures. According 
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to different studies reviewed in IRWI . the segmentation is initiated by the dis- 
continuities, i. e. junctions, along the contour where interpolated edges start to 
connect so-called relatahle edges to form part groupings. The relatability of two 
edges depends on the alignment of their orientations and expresses to what ex- 
tent they could be connected in the sense of a good continuation. This aspect 
together with the consideration of spatial neighborhood is implicitly encoded in 
the layout of tensor voting fields applied to each input token. Thus, we have 
an efficient means to implement processes which have mainly been described 
qualitatively. 

The outline of our implementa- 
tion is as follows (Fig. |2|): First, we 
have to find the contour in the given 
input image in order to compute 
the junctions contained in this form. 

Both tasks can be solved by tensor 
voting, too, as shown in IMLTUUl .The 
next step is a newly introduced voting 
between the junctions: Each junction 
sends out voting fields in the opposite 
directions of its two incoming edges, 
hence looking for continuations of the 
“abruptly” ending line segments. The 
overlapping fields of interacting junc- 
tions will create high curve salien- 
cies between these junctions. By ex- 
tracting the curves along the locally 
maximal saliencies, we get candidates 
for the amodal completions of part 
boundaries which we will call briefly 
virtual contours. In addition to the interactions of junctions, there can be inter- 
actions between a junction and a nearby contour creating the important class of 
T-junctions. A T-junction will in turn vote backwards to the junction by which 
it has been created, thus giving rise to a virtual contour between them. Finally, 
we have to collect the information about real and virtual contours to connect 
them to closed part boundaries. This step is non-trivial because it requires the 
integration of local operations with global constraints. We will define a process 
which uses some kind of cursor, called walker, to find local interpretations based 
on the conservation of direction and region polarity at the different junction 
types to get a globally consistent part boundary. The method facilitates the 
generation overlapping layers for which we will discuss in chapter El strategies to 
achieve meaningful depth orderings. 

2.2 Computation of Real Contours and Junctions 

Given an input image as shown in Fig. we first have to find the boundaries 
of the contained regions. For that purpose we use the region inference method 




Fig. 2. Flowchart of the part decomposi- 
tion algorithm. The right half shows new 
stages introduced in this paper. The letters 
A-F refer to outputs for a running example 
depicted in Figs. El El and El 
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already introduced in iMn'ooi . In summary, a radial isotropic voting field is 
applied to each input location and information from different sites is collected 
by a first order moment computation which aggregates the incoming votes as a 
vector sum. Points along a region boundary will mostly receive votes from one 
side, in contrast to isolated points and points inside the region for which the 
incoming votes are equally distributed over all directions. Hence, the norm of 
the vector sum can be regarded as a measure for the boundary saliency (Fig. EP) 
while its direction represents the direction of the region polarity. Having found 
the boundary points, we can get the curves and junctions by another voting step 
which applies a so-called stick field perpendicularly to each polarity vector and 
uses tensor additions for the collection of the votes. The tensors of the result- 
ing map are decomposed into undirected components, called ball tensors, and 
directional components, called stick tensors. The norm of these components rep- 
resents a measure for the junction saliency or the curve saliency, respectively. By 
extracting the locally maximal saliencies, we simultaneously obtain the junctions 
and the contour of the input form (Fig.0C). 




A) B) C) 

Fig. 3. Computation of real contours and junctions: A) input image, B) region saliency 
map, C) extracted contours and junctions. The graphs illustrate the tensor voting eld 
applied at each stage. See Fig. 0 for the result of the rightmost eld. 



2.3 Computation of Virtual Contours 

In order to create interactions between the detected junctions, we first have 
to determine for each junction the directions of the incoming real contours. 
Naturally, the curve saliency in the vicinity of a junction is very unreliable due 
to the simultaneous influence of two different lines. Therefore, we discard this 
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weak curve information within a predefined radius around a junction and look 
for the edges crossing a circle with similar radius. In addition, we store for each 
incoming edge its region polarity which has been computed previously. As the 
input consists of binary forms, the junctions found so far always consist of exactly 
two crossing edges of the real contour. Hence, we call them Lg-junctions where 
the index denotes the number of incoming virtual contours. 

The first step of junction voting is necessary to detect T-junctions on the real 
contour. For that purpose, at each L-junction semi-lobes of a stick voting field 
are applied in opposite directions of the two incoming edges. Note that the stick 
fields used for both steps of junction voting have been adapted to cover a smaller 
opening angle in order to give higher preference to a straight continuation and 
to reduce position uncertainty which is not needed to the same extent as for 
finding structures in general input domains. 

The interaction of these votes with the real contour transformed to its tensor 
representation yields high junction saliencies at locations of T-junctions which 
are then extracted by non-maxima suppression. The stem of each T-junction 
is computed to point in the direction of the Tg-j unction by which it has been 
created, and the region polarity of this Tg-junction is used as an estimation for 
the region polarity at the T-stem. 

Then, the second step creates all possible virtual contours as high curve 
saliencies by applying junction voting to the set of Tg- and T-junctions (Fig. 
^bre-D.). While the voting fields for Tg-junctions are defined as described before, 
T-junctions send out semi-lobes of a stick field in the direction of their stem. 
The resulting curve saliency map is fed to a marching process which traces the 
locally maximal curve saliencies in order to yield the virtual contours (Fig.0D.). 




pre-Di) Di) pre-D 2 ) D 2 ) 



Fig. 4. Virtual contours: pre-Di) Curve saliency map resulting from the second junc- 
tion voting step applied to the shape of a seven as input. Di) Extracted virtual contours 
(gray) overlaid over the previously extracted real contours (white). pre-D 2 ), D 2 ) Re- 
sults for another example with the shape of a three as input. 
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2.4 Deriving a Shape Description 

In order to introduce a unit formation process yielding multi-layered shape de- 
scriptions, we first have to take a closer look at the different kinds of junctions 
and the information they contain with respect to the indicated shape decom- 
positions. A classification of the junction types is made by inspecting the Lq- 
junctions after the computation of virtual contours and relabeling them to L\- 
j unctions in case of one incoming virtual contour in addition to the two real con- 
tours or to L2-junctions in case of two incoming virtual contours. Subsequently, 
a T-junction will be relabeled to T\ if it has been induced by an Li-junction or 
to T2 if induced by an T2-junction, respectively. 

Among these types an Tg-junction does not contain any information about 
a decomposition. It merely represents a discontinuity on the contour. 

An Ti-junction indicates a shape decomposition along the virtual contour 
yielding two disjunct part^il. The virtual contour is included in the outline of 
both parts. However, its polarity vector in the decomposition points into two 
opposite directions in dependency of the part to which it has been assigned. 

Similarly, each virtual contour of an T2-junction potentially belongs to the 
outline of two disjunct parts, i. e. creates decompositions with polarities in both 
directions (Fig. ^). However, it is also possible that both virtual contours rep- 
resent continuations of two overlapping parts (Fig. Eb). In that case only one 
direction of each polarity vector pair will be used in the decomposition. The dis- 
crimination of both cases is achieved by the walking algorithm described below. 






Fig. 5. I/2-junction: a) Each virtual contour belongs to two disjunct parts (both poten- 
tial polarity vectors occur in the decomposition), b) Virtual contours belong to di erent 
overlapping parts (only one of each polarity vector is used). 



As depicted in Fig.|B| the two types of T-junctions carry different information 
for the suggested decompositions: While the stem of a Ti-junctions belongs to 
disjunct parts and thus bears opposite polarity assignments, the stem of an T2- 
junction represents an occluded contour created by two overlapping parts and 
possesses in the decomposition only one polarity direction which corresponds 
to the contour polarity at the adjacent T2-junction. However, the corresponding 
half of the T2-bar belongs to two parts, both times with the same region polarity. 



^ Note that the interacting junction has not necessarily to be an Li-junction. It could 
also be an L2- or a T-junction. 
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To complete this junction catalog, we finally mention the possibility of an 
L2“junction which is merely created by two virtual contours. However, we will 
postpone the discussion of this junction type because it does not initiate any 
decompositions. It rather ensures correct continuations of straight lines by pre- 
serving the discontinuity at the occluded corner. 

So far all computations have been local operations. As we intend to derive de- 
compositions of the input form, we need to introduce a mechanism which collects 
this information to achieve a globally consistent description. For that purpose, 
we first build an adjacency graph for the junctions, label the junctions by the 
types mentioned above and assign the potential polarity vectors to the contours 
meeting at these junctions. It is sufficient to consider the L\- and L2-junctions 
as seeds for possible decompositions (other types don’t induce decompositions 
or cannot occur independently). 

Then, for each seed we subsequently start outgoing “walks” in one of the 
unvisited directions. This process is based on a cursor, called walker, which 
stores the current position on the contour together with the walking direction 
and the region polarity. The walker is advanced from the current junction to the 
adjacent junction in walking direction and the chosen polarity vectors along the 
traversal are marked as visited. At this junction the outgoing continuation of the 
walk has to be inferred from the incoming direction and polarity. The decision 
is based on a set of predefined rules of which examples are depicted in Fig. 0 In 
general, a continuation tries to preserve direction and polarity. If such a straight 
continuation is not possible, characteristics of the current junction type together 
with the information about already visited contour parts determine a change in 
direction and an adaptation of the polarity vector. 

A walk is stopped on arriving back at the starting junction or in case of no 
possible continuation. However, a closed loop is only regarded as a successful 
solution if the continuation of the walker returning to the start yields the same 
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remaining pol 



one pol or both pols 



remaining polarities 
after one straight traversion 



two times same 
pol on shared side 



remaining pols before traversion 





after traversion of top bar 



Fig. 7 . Rules how to continue walks at junctions: The examples show the rules for 
L 2 -junctions ( rst row) and Ta-junctions (second row). 



polarity and direction as the initially outgoing walk, i. e. the walk would be 
cyclic. Such a walk indicates a successfully decomposed part. Otherwise an undo 
of the unsuccessful walk is a created by restoring all visited polarity vectors. 

The walking process is repeated for all remaining polarities along virtual 
contours until no more decompositions are found. It terminates because the 
number of potential polarity vectors is reduced for each part decomposed and 
there is always one outgoing walk among the seeds which yields a successful part 
decomposition. 

Some results are illustrated by the examples of Fig. 0 It shows for each shape 
that the junctions, including the virtual T-junctions, are extracted correctly 
and labeled according to junction types introduced here. Finally, the walking 
algorithm yields a decomposition into overlapping parts as proposed for the 
generation of an adequate shape description of the input images. 

3 Conclusions and Perspectives 

We have introduced a framework for a unit formation process based on the 
local computation of perceptual groupings by tensor voting and a decomposition 
scheme which collects polarity and directional information to extract overlapping 
parts. Although this last step seems to be iterative and serial, it actually could 
be implemented as a highly parallel process where successful decompositions 
are highlighted by closed walks creating positive feedback loops or synchronized 
cell activities. This approach would be similar to the particle system model of 
| [ITKn+c)9| but with a considerably lower number of iterations and the ability to 
create layered representations. 

In agreement with psychophysical findings, we explicitly don’t make use of 
depth information for the inference of units. We rather intend to achieve depth 
information from the unit formation process. Depth arrangements for sponta- 
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Fig. 8. Resulting part decompositions (F.) based on the detected junctions (E.) de- 
picted for two input images, the shape of seven and of a three. Junctions are illustrated 
by the incoming directions (long arrows) and the initially possible polarity vectors 
(small arrows). Hence, the junction type Lq is formed by two, L 2 by four, and T 2 by 
three long arrows. Parts of the decomposition are depicted in a 3-D display as overlap- 
ping layers of different heights where the arrows along the contour trace the positions 
of the walker. 



neously splitting figures from binary input images can be deduced from the 
length of the virtual contours. Parts with short virtual contours - reflected by 
higher saliencies - are assumed to lie on top of parts with longer virtual con- 
tours. However, further research is needed in order to handle ambiguous or- 
derings where the virtual contours of one part have different lengths or of two 
overlapping parts have similar lengths. This suggests the introduction of bistable 
depth orderings as known to occur in psychophysical experiments, too. 

While the unit formation process should be universally applicable to other 
input domains with additional attributes like color and texture, the influence 
of this information on different depth orderings requires further investigations. 
Nevertheless, the units created solely depend on the form, hence should be the 
same for all domains as for binary inputs. 

Furthermore, we intend to generalize the implementation from polygonal 
shapes to rounded forms which is already facilitated by the fact that the tensor 
voting scheme detects a saliency for “cornerness” and thus allows to handle 
junctions on rounded shapes as corners with lower saliency. 

Currently, only one fixed scale parameter is used by the tensor voting between 
junctions. As long as the size of the voting fields is large enough to bridge the 
biggest gaps, the scale parameter is not critical for the outcome. In contrast to 
other approaches with continuous scale spaces, we think a small number of scales 
will be sufficient to get descriptions on different levels of abstraction. In addition, 
we plan to extend the voting scheme to an adaptive scale parameter proportional 
to the length of the line ending at a junction. This can also be realized by the 
tensor voting approach which allows to define a degree of “endpointness” at the 
line endings. Thus long lines are enabled to bridge large gaps while short lines 
are restricted to a smaller radius to And a continuation. 

Finally, the application of the unit formation to real input images seems to 
be a very challenging goal. However, the restriction to such a simple domain as 
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presented here helps to focus on the fundamental properties of the unit formation 

process which are still open to a wide range of investigations. 
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Abstract. In this paper we present a new class of algorithms for de- 
tecting lines in digital images. The approach is based on a general for- 
mulation of a combinatorial optimization problem. It aims at estimating 
piecewise linear models. A linear system is constructed with the coor- 
dinates of all contour points in the image as coefficients and the line 
parameters as unknowns. The resulting linear system is then partitioned 
into a close-to-minimum number of consistent subsystems using a greedy 
strategy based on a thermal variant of the perceptron algorithm. While 
the partition into consistent subsystems yields the classification of the 
corresponding image points into a close-to-minimum number of lines. A 
comparison with the standard Hough Transform and the Randomized 
Hough Transform shows the considerable advantages of our combinato- 
rial optimization approach in terms of memory requirements, time com- 
plexity, robustness with respect to noise, possibility of introducing “a 
priori” knowledge, and quality of the solutions regardless of the algo- 
rithm parameter settings. 



1 Introduction 

The Hough Transform (HT) and its numerous variants are the classical appro- 
aches used to detect and recognize straight lines in digital images [1], [2]. The 
various HT variants have been developed to try to overcome the major draw- 
backs of the standard HT, namely, its high time complexity and large memory 
requirements. In some cases, variants such as the randomized, probabilistic and 
hierarchical HT [7], [8], achieve an effective complexity reduction. In others, ho- 
wever, they face serious difficulties and fail to provide solutions of the desired 
quality. This happens, for instance, when several line segments need to be si- 
multaneously detected or when there is a relatively high level of noise [1], [2]. In 
general, selecting small values for the thresholds of those HT variants may yield 
erroneous solutions, while selecting larger values may substantially increase the 
computational load and therefore jeopardize their nice features of reduced time 
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complexity and lower memory requirements. Thus, in the presence of several 
lines to be detected, one has to find a delicate trade-off between time/memory 
requirements and quality of solutions. In this paper we present a new approach 
for detecting lines in digital images that differs from the HT-based ones. The 
basic idea is to formulate the problem as that we introduced in [3] for estima- 
ting general piecewise linear models, namely as the combinatorial optimization 
problem of partitioning an inconsistent linear system into a close-to-minimum 
number of consistent subsystems. The method we devise to find approximate 
solutions to those problem formulations provides results of equivalent or even 
higher quality than the HT and it compares very favorably in terms of time 
and memory requirements as well as robustness. The paper is organized as fol- 
lows; section 2 describes the combinatorial optimization formulation, some of 
its properties as well as a greedy strategy to find good approximate solutions 
in a short amount of time. Then some details of the algorithms as well as the 
convergence and projection strategy of the algorithms are presented in section 
3, while some possible optimizations are presented in section 4. Some typical 
results are reported in section 5 and compared with those provided by the basic 
and randomized HT. The paper is concluded by presenting some general remarks 
and perspectives. 



2 The MIN-PCS Based Formulation of Line Detection 

The problem of classifying the points of an image into line segments can be 
formulated as that of partitioning an inconsistent linear system into consistent 
subsystems. Indeed, the coordinates of points belonging to a line segment sa- 
tisfy a simple linear system whose solution corresponds to the parameters of 
the line. In the presence of several lines and noise distributed in the image, the 
linear system is inconsistent, i.e. there exists no solution satisfying the equations 
corresponding to all image points. In such cases, regressions and robust regres- 
sion based approaches are clearly inadequate. The breakdown point of classical 
robust regression methods limits, for instance, their applicability to a very re- 
stricted type of situations. In particular, there must be a dominant subsystem 
that corresponds to at least 50% break-down limits [6], or for other approaches 
[10] the solution is guaranteed only for uniform or “a priori” known noise distri- 
butions. In the case of inconsistent systems corresponding to several “unknown” 
consistent subsystems and noise with “unknown” distributions other approaches 
have to be found. For these reasons, alternative approaches generally based on 
the HT and its numerous variants have been extensively investigated [1], [2], [7]. 
Although, in general, these approaches tend to provide reasonable results and 
to be relatively robust with respect to noise, they have high time and memory 
requirements and they are quite sensitive with respect to the threshold settings. 
In this paper we show that accurate solutions to the problem of line detection 
can be found by considering the following combinatorial optimization problem 
that we have introduced in [3] . 
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MIN PCS: Given an inconsistent linear system A : x = b where A is an m - n 
matrix and x,b are n-dimensional vectors, find a Partition of the set of equations 
into a MINimum number of Consistent Subsystems. 

In the case of line detection, the coefficients of each row of the inconsistent linear 
system correspond to the coordinates of one of the contour points at hand. Any 
partition into a number s of consistent subsystems is then clearly equivalent to 
a partition of all contour points into s line segments. Given the choice of the 
objective function, we look for the simplest set -for the smallest number- of line 
segments that account for all contour points. To cope with noise and quantization 
or acquisition errors, it suffices to replace each equation a^x = bk, where is 
the row of A and bk is the component of 6, by the two complementary 
inequalities: 

ia^-x<bk+e ,, 

\af^ ■ X >hk — e 

where e is the maximum tolerable error. See [3] for the description of a simple 
geometric interpretation of this version of MIN PCS. In the present setting, it 
amounts to finding a minimum number of hyperslabs in n-dimensions whose 
width is proportional to e and such that each point corresponding to the coor- 
dinates of one contour point is contained in at least one hyperslab. 

Although we proved in [3], [4] that MIN PCS is an NP-hard problem, and 
hence, it is unlikely that any algorithm is guaranteed to find an optimal so- 
lution in polynomial time, we have developed a heuristic which works well in 
practice and finds good approximate solutions in a short amount of time. The 
results obtained for other problems and time series modeling clearly confirm this 
assertion [3] [4] . Since in practical applications we are interested in finding close- 
to-minimum partition rapidly, we adopt a greedy strategy in which the original 
problem is subdivided into a sequence of smaller subproblems. Several projec- 
tion schemes can be used for the solution of each subproblem depending on its 
nature. Here we introduce one scheme designed for line detection (i.e. MIN-PCS 
problems with two variables). Starting with the original inconsistent system of 
pairs of inequalities (1), close-to-maximum consistent subsystems are extracted 
iteratively. Clearly the iterative extraction of consistent subsystems yields the 
partition into consistent subsystems. The extraction of close-to-maximum con- 
sistent subsystems is performed by using a thermal variant of the perceptron 
procedure that originally comes from machine learning (see [4], [5], [6] and the 
references therein). The algorithm can be described as follows, see also [3]: 

Problem : Given any system Ax = b and any max;imum admissible error e > 
0, look for an Xmax G E-" such that the couple of complementary inequalities 
< b'^ + s and afxma,x > b^ — E IS Satisfied for the maximum number of 
indices k € {I, . . . ,p}. 

~ Initialization : Take an arbitrary xq G M”, set c := 0, and initial temperature 
t := to, select a predefined number of cycles C as well as function 7 (c, C) 
decreasing for increasing c and such as 7 (C, C) = 0. 
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begin 



z ^ 0; 

repeat 



c-^c+1; t to ■ 7 (c, C*) i >5' "i— {1, • • • , p}; 
while S' ^ 0 do 



pick s G S randomly and remove s from S 

ki G- s; := ■ xf, 




if i^a^'Xi < b^' — e) Xi+\ := Xi + 

else if [a^^Xi > + e) Xi+\ := Xi — bio!^'-, 

Z ^ — Z “t“ 

project the current solution onto the unit cylinder 

while c < C 

take Xi+i as an estimate of Xmax 

end 



where to is determined by the average deviation from consistency (average in- 
equality error) for the current solution Xt at the beginning of each cycle. In- 
tuitively, the behavior of the algorithm can be explained as follows. At high 
normalized temperature t/to, all equations with both high or low deviations 
from consistency lead to a significant correction of the current solution Xi. Con- 
versely, at low temperatures, only those equations with small deviations from 
consistency yield relevant corrections to the current solution. Convergence of the 
procedure is guaranteed because when t is decreased to zero, the amplitude of 
the corrections tends to zero. In our experiments we have used exponentially 
decreasing functions for t, from an initial to to 0 in a predefined maximum num- 
ber of cycles C through the equation set. See [3], [4] for more details on the 
algorithm and the annealing schedules. 



3 Convergence Analysis of MIN-PCS Based Line 
Detection Algorithms 

In this section we study the convergence behavior of the proposed algorithms. So 
as to better clarify convergence issues let us consider a linear system composed 
by only two equations in two variables: 



/ii = 60 (2) 

( 021X1 -I- 022X2 = 02 

These two equations represent two straight lines in IR^. Assuming that the 
two lines are not parallel, the relaxation algorithm described before can be illu- 
strated as in figure 1 (see also [3], [6]). The angle a between the two lines and 
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Fig. 1. Left: graphical representation of the relaxation algorithms to extract close- 
to-maximum consistent subsystems, center: example of convergence for the case of a 
consistent system, right: example for a case of an inconsistent system, i.e. two consistent 
sub-systems. 



the convergence rate are related as follows: 






= cos^ 






= COS 



2n 



a 



(3) 



It is clear that when the two lines are nearly parallel, i.e. when a is close to 0, 
the algorithm will converge very slowly (since cos a is close to 1). These situations 
occur very frequently in line detection problems. It can be easily shown that this 
is the case of relatively short image segments located far from the (conventional) 
coordinate origin in the binary image. Conversely, when a is close to tt/ 2, the 
algorithm converges very quickly. In practice, one should try to make the row 
vectors in the system of linear equations mutually orthogonal by using classical 
techniques see for instance [9]. Unfortunately, these kind of orthogonalization 
procedures cannot be applied in the case of inconsistent systems (i.e. several 
consistent subsystems) because the “line pencil” is not unique. In fact there are 
several consistent subsystems corresponding to different lines as illustrated in 
Figure 1. 

So as to devise a fast line detection algorithm, we have to guarantee a fast 
convergence for all possible input data (i.e. line positions in the image). The 
idea is to find a suitable surface on which to project the current solution Xi 
that is as much as possible orthogonal to all possible lines so as to speed up 
the convergence and to constraint the solution to a desired region of the space. 
With this objective the working space (parameter space) has been extended 
to a three-dimensional space. Each point X = (x,y) in (image space) is 
mapped into a plane in the three-dimensional parameter space IR^ according to 
the linear equation ax + by + c = 0. Then each line to be extracted in IR^ defi- 
ned by {{x, y) G M^/aiX + biy -I- Cj = 0} corresponds to a line in IR^ defined by 
{(x, y, z) € IR^/dy € IR, (a, 6, c) = 7(04, bi, ci)}. Actually, in the case of an incon- 
sistent linear system we have several “plane pencils” corresponding respectively 
to each straight line in the image and each intersects to a line (all these lines con- 
tains the origin since all linear equation are homogeneous). Thus, the problem 
we have to solve is that in three dimensions each solution line contains the origin 
(i.e. (a, b, c) = (0, 0, 0)) so that the algorithm always converge to the trivial solu- 
tion 0. To avoid this occurrence, at each solution update (correction) we perform 
a projection to the unit cylinder of equation {(a, 6, c) G IR^/Va^ -b 6^ = l}- For 
each image point we alternate a projection to the corresponding solution plane 
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Fig. 2. 3-D representation of the projection scheme in case of a consistent system (3 
aligned points in the image correspond to the pencil defined by planes PI, P2 and P3). 



and one to the unit cylinder. Hence, this procedure constrains the current ite- 
rative solution to remain close to the unit cylinder and its intersecting planes. 
Specifically the following non-orthogonal projection is performed: 



d i — 



~-,b ^ 



c •<— 



+ h‘^' y/a‘^ + b'^’ 'Jo? + \P- 



( 4 ) 



Applying this procedure the speed of convergence can now be expressed as: 



cos a = 




( 5 ) 



with nt and the normal vectors corresponding to two consecutive planes. A 
rigorous study of the convergence speed in the general case of an inconsistent 
system is obviously much more complex. This type of analysis must also take 
into account the statistical behavior of the consistent subsystems distributions 
and of the corrections versus the temperature scheme. 



4 Line Detection Algorithms for Shape Detection and 
Tracking 

In the basic scheme described in the previous paragraph, consistent subsystems 
(i.e. lines in an image) are extracted iteratively by the MIN-PCS greedy stra- 
tegy starting from random initial solution and letting the relaxation algorithm to 
converge. This is the most generic approach to cope with applications for which 
no “a-priori” knowledge is available (number of line to be extracted, probable 
positions, object shape, geometrical relation between lines, etc. . . In many ap- 
plications, additional information such as object shape, approximated position, 
number of line to extract, fixed angle (or distance) between lines, etc . . .might 
indeed be available. The nature of the approach allows the addition of various 
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type of a-priori information that can dramatically improve robustness and per- 
formance. This “a priori” information can be easily embedded into the kernel of 
the MIN-PCS based algorithms while classical approaches have to consider it in 
the post-processing stage. 

Multi Solutions Algorithms. An example of such inclusion of “a priori” 
information can be the following. If the number of line is known (or at least, the 
minimum number) several solutions can be combined in the same temperature 
annealing scheme, saving the computation of the temperature decrease scheme 
and the associated post processing stage for each line. There are two possible 
options to perform the correction: 

- For each geometrical point randomly picked into the image (in the inner loop) , 
each solutions is updated. This is the simplest implementation, but because 
of their independence, some solutions could merge and converge to the same 
value. This option is powerful if a good approximation of the solution is known 
(for instance in tracking applications). 

- Only the closest solution is updated. This option avoids the possible merging 
of solutions. 

Geometrical Constraints. The projection scheme introduced previously yield 
directly the Hough parameters of the line (solutions are constrained to the unit 
cylinder) as follow: 

r cos {ah) = Sc-a f 1 T r > n 

< sin (ah) = Sc-b with <^c = j -r “ « (6) 

[dh = S,-c=\c\ ( i It c < U 

With Oh and dh being the Hough parameters of the line. This approach allows 
an easy implementation of additional geometrical constrains into the inner loop. 
For instance, in the case of two simultaneous lines, the second line can be forced 
to be parallel to the first one by imposing that: 

- Only the nearest solution is projected, 

- The other is updated so as to be parallel (just a copy of the a & b parameter) . 

The same strategy can be applied for more than two parallel lines, perpendicular 
or with any given angle. 

Although the general strategy can always be used with very good results, a 
suitable correction strategy for the application at hand can considerably improve 
the overall performance. For instance when MIN-PCS methods are used for in- 
itial features detection, it is only important that the temperature is high enough, 
while the choice of an initial random solution is irrelevant. The only drawback 
is the need of more computation to let the algorithm converge. Conversely, for 
tracking application, the “a priori” information available from the past images 
(i.e. the previous position of the features we are interested in) can be used to 
dramatically reduce the computation load. The initial solution is not chosen 
randomly, but corresponds to one of the probable solutions and the initial tem- 
perature is set to lower values compatibly with the maximum admissible distance 
from the past solution. If the current solution is very near to the past solution the 




Fast Line Detection Algorithms Based on Combinatorial Optimization 417 



algorithm converges very rapidly, if not the algorithm is capable of finding the 
new one within the admissible variation range, thus still saving computation if 
compared to the general detection algorithm not using any “a priori” knowledge. 

5 Some Simulation Results 

The generic MIN PCS algorithm has been applied to some synthetic and natural 
test images with different levels of noise. Fig 3 reports an example of results for 
images without noise and with 2\% of the total image or equivalently of 118% 
of contour points as randomly distributed noise. The same images have been 
processed by the classical HT and by the randomized HT (RHT) [1], [7] with a 
256 • 256 accumulator arrays equal to the image resolution. Table 1 summarizes 
the results of a profiling analysis of HT, RHT and MIN PCS algorithms on a 
SUN UltraSparc WS. All results are normalized to 3.57 seconds. The lower part 
of the table indicates the (subjective) quality of the obtained results. 

As confirmed by all our experiments, the MIN PCS approach provides the 
highest quality results in all noise conditions. Moreover, the time requirements 
are much lower than the HT for low levels of noise and comparable for higher 
noise levels, while yielding higher quality results. At high noise levels no re- 
sults can be obtained with the HT. RHT shows its limits and fails to provide 
any results even for medium levels of noise. It has to be pointed out that the 
comparison would be much more favorable for larger image sizes. For instance, 
considering the same image, but at double resolution (512 • 512 pixel instead of 
256 • 256) HT memory requirements and processing time increase by a factor of 
4, while the time and space complexity of our algorithm only increases with the 
number of contour points and number of subsystems. A more detailed analysis 
of memory and complexity of MIN PCS algorithms versus the HT is omitted 
here for brevity. Another result that demonstrates the excellent performance and 
robustness versus noise of the MIN PCS based algorithm is reported in Fig. 4. 
The image contains 500 points obtained by adding a Gaussian noise of <7 = 10 



Table 1. Comparison of speed (in seconds) and quality of results obtained with HT, 
RHT and MIN PCS algorithms fort he image of Fig. {house} at different levels of noise. 
(*) The noise % refers to total image points (256 x 256). (**) The noise % w.r.t. total 
contour points. [4— f] all information is correctly detected, [-I-] all information can be 
extracted by a simple post-processing stage, [-] information is missing, [/] no results 
can be obtained. 



Noise % {*) 


0 


1 


2 


5 


10 


15 




0 


59 


118 


295 


590 


885 


HT 


1 


1.16 


1.29 


1.77 


4.79 


4.76 


RHT 


0.08 


0.13 


0.14 


0.46 


3.30 


4.01 


MIN-PCS 


0.06 


0.49 


0.93 


2.19 


11.2 


28.0 


HT 


-t+ 


++ 


+-f 


- 


- 


/ 


RHT 


-t+ 


++ 


+ 


- 


/ 


/ 


MIN-PCS 


-t+ 


++ 


+-f 


-f 


-f 


- 
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Fig. 3. From left to right: original gray scaled test image, original binary image (after 
basic edge detection), results of MIN PCS based line detection, results with 295% of 
the points as random noise (i.e. 5% of the total number of points). 




Fig. 4. From left to right: MIN-PCS approach (original image: two lines whose points 
have been displaced by quantity with a Gaussian distribution), MIN-PCS with 250% 
of additional random noise (w.r.t. the points of the original image), HT with 250% of 
additional noise, RHT with 250% of additional noise. 




Fig. 5. Left: Synthetic image composed by 10 randomly distributed lines. Right: MIN 
PCS solution when 50% are the original image points and 50% are noise (randomly 
distributed points). 



pixels to two original segments. For various noise levels, the MIN PCS approach 
always recovers the two original segments with 3.65 seconds of processing time. 
As shown in Fig. 4 (right), RHT never provides a correct result and 65% of the 
segments determined by the HT (in 6.25 seconds) are not correctly grouped or 
do not have the correct parameters. 

Another example of results is reported in Fig. 5. These tests are based on 
images containing 10 randomly distributed segments with additional speckle 
noise corresponding to 50% of the line points. The MIN PCS approach always 
provides the correct results. Since in average each subsystem contains 3-5% of 
the total points, no method based on robust regression techniques can be applied. 
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6 Conclusion 

We have presented a class of algorithms based on a general combinatorial opti- 
mization formulation able to detect lines in images. The problem is formulated as 
that of partitioning inconsistent linear systems into a minimum number of con- 
sistent subsystems. The linear system is obtained by the contour lines extracted 
by the images. A generic algorithm as well as possible variants able to include 
“a priori” information available in some applications have been described. A 
projection strategy avoiding critical convergence speed for some data distribu- 
tions, short segments located far from the conventional origin of the reference 
system, has been also developed. The MIN PCS approach can be applied to a 
variety of other problems for which piecewise linear models are valuable. Since 
higher degree polynomials can be viewed as linear functions with respect to their 
coefficients, the approach can also be extended to the estimation of piecewise 
polynomial models with submodel of bounded degree, thus also for detecting 
other shapes than lines. 
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Abstract. Koenderink characterizes the local shape of 2D surfaces in 
3D in terms of the shape index and the local curvedness. The index 
characterizes the local type of surface point: concave, hyperbolic, or 
convex. The curvedness expresses how articulated the local shape is, 
from flat towards very peaked. In this paper we define corner points 
as point on a shape of locally maximal Koenderink curvedness. These 
points can be detected very robustly based on integration indices. This 
is not the case for other natural corner points like extremal points. 
Umbilici can likewise be detected robustly by integral expressions, but 
does not correspond to intuitive corners of a shape. Furthermore, we 
show that Koenderink corner points do not generically coincide with 
other well-known shape features such as umbilici, ridges, parabolic lines, 
sub-parabolic lines, or extremal points. This is formalized through the 
co-dimension of intersection of the different structures. 

Keywords: Shape features, shape evolution, differential geometry, sin- 
gularity theory 

1 Introduction 

A shape is, in our context, a 2D surface embedded in 3D space: the boundary 
of a physical object. Problems like geometrical alignment (registration), inter- 
polation, and recognition of slightly deformed shapes are far from trivial. The 
local properties of a shape may be used for defining local shape features such as 
corners, edges, cylindrical (parabolic) lines, sub-parabolic lines, etc. These 
may then be used for geometrical alignment PSI and interpolation M whereas 
their topological properties may be used for creating graph representations of 
shapes in turn used for recognition and indexing ^2]. 

A visual shape is defined through measurements. That may be optical or 
range images of physical objects or density measurements in volumetric (medical) 
images. A well-founded and robust way to formalize measurements is through 
the concept of Gaussian scale-space theory |/l I bj . In this case, the shape will 
generically be differentiable owned to the differentiability of the Gaussian kernel. 
Hence, in this paper we examine analytically parametrized shapes: 



S' : ]R^ 




where x,y,z are analytical functions. 



C. Arcelli et al. (Eds.): IWVF4, LNCS 2059, pp. 420-EMl 2001. 
© Springer- Verlag Berlin Heidelberg 2001 



Koenderink Corner Points 



421 



In this paper, the canonical computational example is surfaces defined as iso- 
intensity surfaces in a 3D volumetric (medical) image. Generically iso-surfaces 
of an analytical (scale-space) image are analytical surfaces. However, generically 
we will also see not analytical surfaces for isolated intensity values. We will in 
this paper also touch upon the shape properties in transitions through these 
isolated cases. 

In next section, we will review well-known local shape features, and we will 
in this context define Koenderink corner points and Koenderink corner lines. In 
section 0 we give proofs that they are unrelated to classical shape features ma- 
in section 0 we give examples on robust detection of Koenderink corner points 
on an artificial example and on a CT scan of a mandible. 



2 Local Shape Properties 

The local properties of an analytical shape is most easily accessed through its 
local differential structure. In the following we briefly review well-know concepts 
from differential geometry 0 to establish notation. 

Locally (in a Monge patch) an analytical shape can be described as the 
orthogonal deviation from a tangent plane. Parameterizing the tangent plane 
{x,y) with origo in the osculation point yields the local description: 

z{x, y) = — + z^yxy + Zyy^ + 0{{x, yf) 

It is always possible to rotate the {x, y) coordinate system so that Zxy = 0 
and Zxx > Zyy. Hence 



z{x, y) = kix^ + k 2 y^ + 0{{x, y)^) 

where k \ , ^2 are the local principal curvatures corresponding to the curvature 
in the first {x) and second (y) principal directions. We denote by (fi,f 2 ) tbe 
two orthogonal principal directions. Moving the tangent plane along the shape, 
the principal directions turn. The integral curves of the principal directions are 
denoted the lines of curvature. Normally 0 the local shape is characterized 
through the Gaussian curvature k\k 2 and the mean curvature 

In the Table 0 definitions of some local shape features are given Emm . 
The dimension denotes the generic dimensionality of sub-manifolds having the 
respective properties. A negative dimension denotes the co-dimension, i.e. the 
number of free control parameters necessary for the property to generically ap- 
pear in a point on the surface for fiducial parameter values. 

Elliptical and hyperbolic points appear in areas separated by lines of 
parabolic points. Umbilic points generically appear as isolated points in the 
elliptical regions, or in co-dimension 1 in a parabolic point. These latter points 
are denoted planar points since the total second order structure vanishes in these 
points. The lines of curvature form closed curves on the surface. They start and 
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Table 1. Generic dimensionality of shape features. 



Name 


criterion 


dimension 


Elliptical 


kik2 > 0 


2 


Hyperbolic 


k\k2 < 0 


2 


parabolic 


kik2 = 0 


1 


Umbilic 


fci = k2 


0 


Planar 


ki = k2 = 0 


-1 


Ridge 


dt^ki = Qox-^k2 = 0 


1 


sub-parabolic 


= Oor^l^fei = 0 


1 


Extremal 


347^1 = = 0 


0 



end in umbilic points. Generically three types of umbilici exist |1 fl| : lemon, mon- 
star, and star. They respectively have 1, 2, and 3 lines of curvatures passing 
through the umbilic. 

Including the 3rd order structure in the characterization, we may define 
ridges. At a ridge the principal curvature is extremal in its corresponding princi- 
pal curvature direction. These points generically form closed curves. The curves 
may again be sub-divided in 8 cases according to whether it is the first or second 
curvature, whether it is a maximum or a minimum, and according to the sign of 
the curvature. The most interesting types of lines are those where the positive 
first principal curvature is maximal or the negative second principal curvature is 
minimal. These respectively form convex and concave edge (or corner-line) can- 
didates. They have successfully been applied to non-rigid registration of medical 
images under the name of crest lines m Here the ridges has also been denoted 
the extremal mesh. The points of intersection of ridges in the first and second 
principal curvature are denoted extremal points and form corner candidates. In 
the mathematical literature they have also been denoted purple fly-overs as the 
ridges in the first and second principal curvature have respectively been noted 
red and blue ridges. 

The sub-parabolic lines are in some sense dual to ridges. They denote points 
that locally look like a surface of revolution. They are defined as the set of 
points where the first principal curvature is extremal in the second principal 
curvature direction, or vice versa. They also correspond to the points where the 
focal surface generated by an osculating circle along the lines of curvature has a 
parabolic point. 



3 Koenderink Corner Points 



The classical way of describing the local shape of a surface is through the Gaus- 
sian and mean curvature. However, the local shape is much more intuitively 
described in log-polar coordinates |S|: 



9 = atan 



k2 

ki 



G 



37T 7T 



4 ’ 4 
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c = log(fci +fc|) 

The shape index 9 describes the local type of shape as a continuous parameter. 
It travels from a concave sphere through a concave cylinder, a balanced saddle, 
and a convex cylinder to a convex sphere as the index increases (see figure ^ . 
The eurvedness c describes how articulated the shape is. A large value denotes 
a very articulated local shape: A small sphere has a larger curvedness than a 
larger sphere. Unlike the mean and the Gaussian curvature, the shape index is 
invariant to scalings of the shape. A scaling only adds a constant value to the 
curvedness in all points. The shape index and curvedness, in this way forms a 
very natural description of the local shape • 




Fig. 1. The Koenderink shape index 9. Drawings must be interpreted such that the 
outward normal is upwards. 



We make the following assertion: 

Definition 1 (Koenderink corner point). A point of locally maximal curved- 
ness is a Koenderink corner points. 

Likewise we denote points of minimal curvedness as Koenderink flat points. In 
the following we will show some of the generic properties of Koenderink points 
and only comment on the type (flat or corner) in the cases where they have 
different properties. We show the generic properties by referring to the transver- 
sality theorem: it is generic that manifolds in jet-space intersect transversally 
[II 7j . This means, we can show generic properties of local shapes by showing 
that they intersect transversally in the space formed by the local derivatives. 
Since the logarithm does not influence the extremality, we analyze the simpler 
expression k\ -\- in the following. 

Theorem 1 (Morse property). Koenderink corner points are generically iso- 
lated points on a surface. 
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Proof. The criteria for a point being a Koenderink corner point may be written 



We can always turn the coordinate system such that Zxy is zero. In the case 
where k 2 ^ 0 we may rewrite the conditions as: 



These two manifolds intersect transver sally. They both have co-dimension 1 in 
jet-space, so generically their intersection is of co-dimension 2. Since a surface 
is a 2D manifold, the points are generically of dimension 0. That is isolated 
points. 

Left is only to analyze the case of Zyy = 0. Owned to the algebraic symmetry 
in Zxx and Zyy we obviously may obtain the same result if instead Zxx ^ 0. Left 
is the case Zxx = Zyy = Zxy = 0. This happens in co-dimension 3 in jet-space. 
This is a planar point, and also a Koenderink flat point. 

The planar points occur in co-dimension 1. Parabolic points occur where the 
determinant of the Hessian of z is zero. This is the case on a cone in jet-space. 
The planar points occur on the tip of this cone. 

The singularities in Koenderink curvedness occur generically as isolated 
points and in co-dimension one they also occur in planar points. We have al- 
ready here seen the first first example of how to prove the dimensionality of a 
feature. The last case of a planar point shows how to prove the coincidence of 
two features: We simply augment the system of equations for being a Koenderink 
corner point with the defining equation of the feature in mind, and solves the 
equations for some variables, and finally analyze the situations of any denomi- 
nator being zero 0. Following this scheme we come up with the following table 
of dimensionality of coincidence 11 dl : 



In all cases, for corner points not being planar points, this corresponds to 
what one would expect from a simple analysis of dimensionality. In the special 
case of a planar point, it is the matter of definition whether these points are also 
ridge, extremal, and/or sub-parabolic. The definition of all these types refers 
to the principal directions. In umbilici (of which the planar points is a special 
example) this coordinate system breaks down. The structure of ridges round 
umbilici is very elegantly analyzed by Bruce, Porteous, and Giblin iMg. 





Corner Planar 



Umbilic 

Parabolic 

Ridge 



-2 -1 

-1 -1 



-1 

-2 

-1 



Extremal 

Sub-parabolic 
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The conclusion is: The Koenderink corners are unrelated to the other struc- 
tures. In umbilici in general, the definition of ridge, extremal, and sub-parabolic 
points break down. In the following, we will analyze the structure of the corner 
points on a changing shape (including topology change) , and look at the global 
constraints through the index theory. In turn, we are going to use the index 
theory also for robustly detecting the Koenderink corner points. 



3.1 Transitions 

The singularities in curvedness appear in three different types on the surface: 
minima, saddles, and maxima. The extrema correspond respectively to maxi- 
mally fiat and maximally curved points. Traversing through a generic saddle in 
one fiducial direction, it is a minimum in curvedness while it is a maximum in 
curvedness in the orthogonal direction. The curvedness is defined as a function 
value at any location of the surface. Like a function i— >■ IR generically will ex- 

hibit structures like ridges, iso-level lines, watersheds, etc., so will the curvedness 
defined on a shape. 

In co-dimension 1, the Koenderink curvedness generically exhibit the same 
structure as a general analytical function: 

c{x,y) = X 
c{x,y) = ±x^ ±y'^ 
c(x, y) = x^ + tx iL y'^ 

Where t is a general control parameter. The structure is defined up to a co- 
ordinate transformation on the shape j I . This corresponds exactly to the 

structure of general smooth functions. Thereby is not said that the curved- 
ness generically will reveal the same structure as an general function in any 
co-dimension. 

One interesting case is still the planar point, that occur in co-dimension 1. 
However, in a planar point, the curvedness does not change structure at all. All 
the way through transition of a convex point becoming planar and then concave, 
it will be a minimum in curvedness. 

Another interesting transition happening in co-dimension 1, is the change of 
topology of an iso-surface in an volumetric image. A surface change topology 
by either having a new hole in the shape or by a merge with a nearby shape. 
In the case of iso-surfaces of volumetric images this happens when we vary the 
intensity defining the iso-intensity surface through the value of a saddle. Since 
the curvedness is independent of a sign-change of the curvatures, the curvedness 
of the boundary of a solid shape and the curvedness of the boundary of the 
complement of the solid shape are identical. That means that the situation of 
creation of a hole and the merging of two surface create the same transitions. The 
generic transition is that two maxima of curvedness (one on each shape) meet 
in a non-differentiable surface points from where two saddles and two maxima 
appear (see Figur^. 
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Fig. 2. The generic change of topology when the level of an iso-intensity surface in a 
volumetric image is varied. The shape may be either the interior or the exterior of the 
time glass shape. This corresponds respectively to the splitting of a shape going from 
(a) through (b) to (c) and to the creation of a hole going from (c) through (b) to (a). 
M indicate a point of maximal curvedness. S indicate of saddle point in curvedness. 



3.2 Global Structure 

The curvedness is a genuine function defined all over the shape, c : S' i— ?> IR. This 
means that it must satisfy standard topological constraints formalized through 
the index theorem |^. First we define the index of function to be the number 
of turns of the gradient field around the point in a sufficiently small loop. If 
the point is a regular point the index is zero. If the point is a maximum or 
minimum the index is plus one. If the point is a saddle the index is minus one 
(see Figure 13). The index theorem reads: 

Theorem 2 (Surface index). I, the point index summed over all points of 
surface is related to the Euler number E of the surface so that I = 2(1 — E) 

The Euler number is, in intuitive terms, the number of holes in the shape. 




Fig. 3. Solid lines indicate lines of steepest descend and dotted lines indicate iso-level 
curves round an extremum (a) and a saddle (b). Following the rotation of the lines 
round the singularity yields one turn in positive direction (a) and one turn in negative 
direction (b) respectively. That is, the local surface index round an extremum is 1, 
while it is -1 round a saddle. 



This yields a global constraint on the number of the different types of sin- 
gularities in the curvedness on a shape. As long as the shape does not change 
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topology the number of extrema minus the number of saddles is constant. Since 
transitions where singularities interact are local, they must also be index neu- 
tral, as the transition through a planar point mentioned above. However, the 
case where the topology changes also changes the shape index. When a hole is 
created the Euler number is increased by one, and thereby the number of saddles 
must increase by two compared to the number of extrema. This is the case in the 
above mentioned transition. Two saddles / = 2 is transformed into two saddles 
and two maxima 7=0 when the Euler number is increased by one (a new hole 
is created) . Summing the index over several shapes yields the intuitive interpre- 
tation as the index being twice the number of shapes minus twice the number 
of holes. In the case where the change of topology corresponds to a merging of 
two shapes the total index is decreased by two corresponding to one less shape 
with the same total number of holes. 



3.3 Shape Edges 



So far we have looked at points of singular curvedness. Ridges have been proposed 
as edge lines. Especially the subset of ridge lines where the absolute largest prin- 
cipal curvature is absolute maximal in its principal direction have been denoted 
crest lines US! and been used as semi landmarks |2| for non-rigid registration 
(geometrical alignment) of medical images | I tij . 

Using the curvedness for defining edges (or ridges) opens for an alternative 
definition. Since the curvedness is a field defined over the surface, ridges in 
this measure may alternatively be defined as the watersheds. The watersheds 
form closed curves along steepest descend lines from Koenderink corners towards 
saddles of curvedness. In each saddle two ascending and two descending steepest 
descend lines start. A subset of the ascending lines form the watersheds. The 
watersheds can not be defined locally. It is a semi global property whether a 
steepest descend line ends in a saddle or not. The watersheds are our favorite 
definition of shape edges for several reasons. They form closed curves. Crest-lines 
may end in umbilici as the ridge through a lemon (as an example) changes from 
the first to the second principal curvature P). Secondly, the edges of a shape 
are not according to our intuition locally definable. Finally any number of edges 
may generically start from a given corner. The crest-lines generically meet one, 
two, or three in umbilici and two or four in extremal points. These are unnatural 
constraints on their topology. 

An alternative local definition of shape edges is similar to the definition of 
crest lines: lines where the curvedness is locally maximal in a principal direction. 
This definition allows for local analysis of the generic coincidence with other 
shape features. Similar as for the corner points we may conclude that these lines 
do not coincide with the above mentioned shape features, the following table of 
dimensionality of coincidence uni summarizes: 
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Koenderink edge 


Umbilic 


-1 


planar 


-1 


Parabolic 


0 


Ridge 


0 


Extremal 


-1 


Sub-parabolic 


0 



4 Detection of Shape Characteristics 

In our setting, we define shapes as iso-intensity surfaces in volumetric images. 
We may find the local geometric structure of the surface by applying the implicit 
function theorem HSl- From here it, at first glance, seems straight forward to 
find the ridges as zero crossings of an appropriate non-linear combination of 
image derivative. However it turns out that the appropriate expressions are not 
invariants with respect to change of the coordinate system. This problem is a 
serious problem and not possible to avoid. In the following we give an intuitive 
explanation of the problem, as identified by Thirion et ah. We argue that this 
problem is unavoidable, and show that for the Koenderink corner points, a robust 
and consistent method exist. 

The problem of detecting ridges origins in the fact that the lines of curvature 
are not oriented. An odd order derivative along the line thereby has an arbitrary 
sign. The extremal mesh and the crest lines are defined in terms of a first order 
derivative along the lines of curvature. Spurious lines will be detected where we 
arbitrarily change direction along the lines of curvature. A shape with topology 
different from a torus can not be globally parametrized. We see that immedi- 
ately from the index theorem. As long as the sum of index over the shape is 
different from zero, at least one singularity must be present. This is normally 
denoted the Hairy Ball Theorem: a ball of tangential hair must have at least one 
whirl. Complex algorithms may overcome these parameterization problems by 
looking at stability of lines when coordinate systems are changed or by making 
parameterizations of the surface with only one singularity, in a point in advance 
known not to be of interest. However, for the Koenderink corner points a much 
easier alternative exists: 

Since the Koenderink corner points appear as extrema of a field, they may 
easily be extracted by use of the index theorem. Similar has been done for um- 
bilici CH and for general singularities . The index is evaluated as an integral 
of an analytical function, and thereby by definition a robust measure. The com- 
putation is well posed. In the Figures 0 and 0 examples of Koenderink corners 
and extremal points are shown. For sake of the comparison they have here both 
been extracted by use of zero crossings. 

The algorithm for using the index computation may be outlined as follows 

HH: 

- compute the 3D gradient of the curvedness 
~ project this onto the surface of interest 
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- divede the surface into cells 

- compute the index by adding orientation differences round every cell ^ 





Fig. 4. (a) are the Koenderink corner lines (dtjC = 0 or = 0) and (b) are the 
ridges lines = 0 or 9*2^2 = 0) on the same ellipsoidal. This ellipsoidal shape 

is non-generic, and in this case the the Koenderink corner lines and the ridge lines 
coincide. Computations are based on intersection of the surface with zero-loci of the 
product of differential expressions for each of the two types of lines respectively [E]. 






Fig. 5. Koenderink edge lines (black) and ridges (gray) on the same mandible in three 
different projections using same algorithm as in the above figure. The lines do not 
generally coincide as this is a generic shape. However 



5 Summary 

We have seen that the Koenderink corner points are simple geometrical local 
features of a surface. They intuitively correspond well to our notion of shape 
corners. In practice, they are often very close to the extremal points j I .*^j . The 
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evolution of Koenderink corner points under deformation of the surface is simple, 
it corresponds to what we know from general functions. They can be detected 
using the index theorem (we have not shown this in practice here, but refer to 
Sander and Zucker in)- We expect the extrema of Koenderink’s curvedness to 
be able to play a significant role as shape landmarks in biomedical applications. 
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Abstract. In this paper, a dynamic model for contours using wavelets is 
presented. First it is shown how to construct prohabilistic shape priors for 
modeling contour deformation using wavelets. Then a dynamic model for shape 
evolution in time is presented. This allows this formulation to be applied to the 
problem of tracking a contour using the stochastic model to predict contour 
location and appearance in successive image frames. Computational results for 
two real image problems are given for the Condensation (Conditional Density 
Propagation) tracking algorithm. It is shown that this formulation successfully 
tracks the objects in the image sequences. 



1 Introduction 

Tracking is a topic of considerable interest in computer vision due to its large number 
of applications to autonomous systems [1], object grasping [2] or augmentation of 
computer’s user interface [3]. In this work we will assume that an object can be 
tracked by its shape. Therefore two important problems have to be addressed: shape 
representation and shape dynamics. 

Recently, a new multiscale technique for shape representation has been developed 
based on wavelets [4], [5] and multiwavelets [6]. There are a number of salient 
features in wavelet transforms that make wavelet-domain statistical contour 
processing attractive: 

□ Locality: Each wavelet coefficient represents the signal content localized in spatial 
location and frequency 

□ Multiresolution: The wavelet transform analyzes the signal at a nested set of scales 

□ Energy Compaction;_A wavelet coefficient is large only if singularities are present 
within the support of the wavelet basis. Because most of the wavelet coefficients 
tend to be small, we need to model only a small number of coefficients. This is of 
particular importance in real time applications. 

□ Decorrelation: The wavelet transform of real world signals tend to be 
approximately decorrelated. 

Given a wavelet representation of shape it is desirable the definition of dynamic 
models since they greatly improve tracking algorithms by establishing a prior for 
possible motions. This dynamical prior distribution applies between pairs of 
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successive frames. In this work we will show how to integrate a probabilistic priors 
for shape deformation based on wavelets in a dynamic model that can be used for 
tracking shape in an image sequence. In particular we will use the Condensation [7] 
tracker to present several computational results of this model. 

This paper is divided in five parts: in Section 2 the wavelet based probabilistic 
shape model is presented and its relation with Besov spaces is established. In Section 
3 we introduce the dynamic model, then in Section 4 we show several tracking 
applications of this formulation both in an indoor and outdoor scene and finally in 
Section 5 we present the conclusions of this work. 



2 Probabilistic Priors for Wavelet Shape Representation 

In this section probabilistic priors for wavelet representations of shape will be 
introduced and its properties will be discused, specially a parametric specification for 
contour deformation. 



2.1 Wavelet Shape Representation 

A wavelet basis uses translations and dilations of a scaling function ^and a wavelet 
function (p. If translations and dilations of both functions are orthogonal a 1-D 
function /can be expressed as: 

fo 00 J_ 

d.j^2^ipi2^x-k) ( 1 ) 

k&Z ° jo k&Z 

Let then r{s) = (x{s), y(s)) be a discrete parametrized closed planar curve that 
represents the shape of an object of interest. If the wavelet transform is applied 
independently to each of the x(s), y(s) functions, we can describe the planar curve in 
terms of a decomposition of r(s): 



Jo 



r(s) = 



keZ 



j=h 



j,k 



i 

2^cp{2^ s-k) 



( 2 ) 



= 



f \ 




( \ 
a j,k;x 


c j.k-x ' 


^dj.k = 


yC j,k;y J 




yd j,t,y ^ 



( 3 ) 



where subindex x and y represent coordinate function pertenence. 
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2.2 Wavelet Based Probabilistic Modeling of Curve Deformation 

The simplest wavelet transform statistical models [8] are obtained by assuming that 
the coefficients are independent. Under the independence assumption, modelling 
reduces to simply specifying the marginal distribution of each wavelet coefficient. 
Wavelet coefficients are generally modeled using the generalized gaussian 
distribution, in this work the usual gaussian distribution will be used as an 
approximation 

For the tractability of the model, all coefficients at each scale are assumed to be 
independent and identically distributed. That is: 
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(4) 



where I denotes the identity matrix. 

Assuming a exponential decay of the variances, the final model is: 

dj,k ~N2{dj_,,2-^^^<jll) 



(5) 



In order to complete the definition of the model we have to specify the distribution 
for the coefficient associated with the scaling function This coefficient is 
associated with a rigid traslation of shape and we will assume that it is normally 
distributed and independent of the non-translation components rf, 



^ 0,0 



/ \ 
CQ,0,x 



^ 2(^0,0’^cd) 



(6) 



We will use this distributions to model smooth changes of shape between frames. 
A justification for the proposed model comes from the following theorem that is 
adapted from [8] to the one dimensional case: 



Theorem 

Let/(x) a real function where x is a real variable. Let it be decomposed in wavelet 
coefficients and suppose each coefficient is independently and identically distributed 
as: 



d, ~ N(0, d) with , cr=2'^V„ (7) 

with /? > 0 and > 0 then, for 0 < p,q < the realizations of the model are almost 
surely in the Besov Space B“^(Lp(I)) if and only if /?> a-i-1/2. 

Besov spaces are smoothness spaces: roughly speaking, the parameter a represents 
the number of well behaved derivatives of /. 

With the above assumptions we can then define a prior probabilistic shape model 
for curve deformation as: 
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p(X) az - X/ (X-Z)j, 

^ ~ (^0,0,x ’ ^0,0,x >•••; d jjc,x ’■•■d j_j X ’ 

’ ^0,0,y ’ ^0,0, y ^—’^j,k,y ^ 



( 8 ) 



j^0..J-l,k^0..2j -1 

where n=2’ is the number of points in the discretized curve, Z is a vector of 2n 
wavelet coefficients and Zis a diagonal matrix with the variances defined in (5), (6). 



2.3 Parameter Estimation 

We will show two methods for estimating the parameters in the model, the first one is 
based in the maximum likelihood equations and the second one in the mean square 
displacement. 



2.3.1 Maximum Likelihood Estimation (MLE) 

Given a set of samples for the wavelet descriptors {Zj X^,..., Z^} The equations for the 
MLE are: 
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(10) 



All parameters can be calculated easily except (5. Its equation must be solved 
numerically. 



2.3.2 Estimation by Mean Square Displacement 

In case we have an estimate of P and the mean square deformation from the reference 
shape we can use the following result: 
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Proposition 

Let a curve be described as (8) with arbitrary 2" then the mean square displacement 
along the curve is given by: 

_2 _ TraceCD (H) 

n 



if iiis defined as (8) then: 
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displacement due to translation and non-translation mean displacement pj and 
respectively. 

In the following figure we can see a reference shape (in discontinous line) with 
some realizations of the probabilistic model for various values of parameter J3. As 
expected, when the parameter increases smoother deformations arise. Around the 
figure we can see in light grey a 99% confidence interval for the points in the curve. 
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Fig. 1. Realizations of the probabilistic model. Parameter values are j5=0 (no deformation 
smoothing) for the left image and j3=l.6. in the right image. 



3 Wavelet Dynamic Modeling 

In order describe shape motion a second order autoregressive AR(2) process in shape 
space will be used 

X(t,)-X= A, (Z (f,s ) - ^) + 4 (X(t,_, ) - Z ) + Ml, , (13) 

where Z represents a mean shape and Z the noise covariance. 
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Therefore motion is decomposed as a deterministic drift plus a diffusion process 
that is assured to be smooth using the above derivations. 



3.1 Mean Displacement Parameter Determination 

In order to make the model usable it is necessary to show how to determine the 
parameters A^, A^, X We first decompose motion in several orthogonal linear 
subspaces P, determined by their projection matrix P.. Typically these subspaces are 
translation, rotation and deformation (euclidean similarities) or translation, affine 
change and deformation (planar affine motion). 

Therefore we can write: 

= (14) 



and we will model dynamics into each subspace: 



A,=^afP„ 



R 



and use the following theorems: 



(15) 



Theorem 

Let contour dynamics be given by (13) and (15) and suppose that a steady state 
distribution exists. Then the distribution is normal with mean X and its covariance 
matrix verifies: 

Trace(CJ=^Trace(C„;) (1^) 



Where C^is the covariance of the steady-state into subspace P, 



Theorem 

Let contour dynamics be given by (13) and (15) and suppose that a steady state 
distribution exists into subspace P,. Then its distribution is normal with mean P. X 
and its covariance matrix C^. verifies: 



Trace(C„,) 
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(17) 



Corollary 

Let contour dynamics be given by (13) and (15) and suppose that a steady state 
distribution exists into subspace P|. To obtain a mean displacement p, we must set 
to: 
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This leads us to determine the parameters associated with the random noise if we can 
estimate the deterministic motion and the mean square displacement of shape. 

In case no steady-state distribution exists we can use the following theorem: 



Theorem 

Let contour dynamics be given by (13) and (15) and suppose that no steady state 
distribution exists into subspace P|. Then the mean displacement ]5j (k) into subspace 
Pi at time verifies: 



Pi(k) ~ ^^Trace(Pil) bj 



(19) 



3.2 Maximum Likelihood Parameter Determination 

Parameters can also be estimated by maximum likelihood leading to a set of equations 
similar to (9), (10) 
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As in (9) all equations are easily computed except equation in /? that has to 
solved numerically. 
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4 Applications to Contour Tracking 

A set of experiments have been carried out to test the validity of the approach both in 
an indoor and outdoor scene. To track shape over different frames the Condensation 
algorithm has been employed. It uses a set of samples S from the pdf of the contour 
distribution. Then the algorithm uses the dynamic model to propagate the pdf over 
time using the samples, obtaining a prediction of shape appearance and position. 
Using measurements in the image the pdf tends to peak in the vicinity of observations 
leading to distributions for shape at succesive time steps. 

The number of wavelet coefficients used has been 16, the wavelet function used is 
Daubechies LA(8) and the number of elements in set S has been 250. 

In the first example (Fig. 2) motion is modelled as traslation plus deformation with: 

Traslation subspace P,: Oj -2, - -I, p - 3 (No steady-state) 

Deformation subspace a\ =0, af = 0, p = 0.5 j3=2.25 




Fig. 2. An indoor scene. Frames are numbered 1-6 from top to bottom and left to right 



To visualize the pdf a set of 20 curves has been sampled from the contour 
distribution in all frames. In frame 5 we can see how a false matching appears (the 
distribution becomes multimodal), however the algorithm recovers as the hands goes 
on moving as can be seen in frame 6. 

In the second case (Fig. 3) the background is cluttered with changes between light 
and shadows and there are several moving elements interacting with the person being 
tracked. Parameters in this case have been: 

Traslation subspace P,: aj -2, ai = -I, p = l (No steady-state) 

Deformation subspace a 2 - 0, a 2 =0, p = Q.\ 

and the smoothness parameter for frame to frame deformation has been j5=2.25 
As we can see the head of the girl is in general successfully tracked. In frame 5 the 
dynamic model fails because the girl suddenly stops leading to the curves being more 
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disperse around the head. However the condensation algorithm recovers as the girl 
goes on moving as can he seen in frame 6. 




Fig. 3. An outdoor scene. Frames are numbered 1-6 from top to bottom and left to right 



5 Conclusions 

In this work, a dynamic autoregressive model for shape change in wavelet space has 
been presented. It is shown how to use this formulation to predict contour location 
and appearance in successive image frames. These components are integrated in the 
Condensation tracking algorithm and computational results show that this formulation 
successfully tracks the objects in the image sequences. 
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Abstract. A method for partitioning shapes is described based on a 
global convexity measure. Its advantages are that its global nature makes 
it robust to noise, and apart from the number of partitioning cuts no 
parameters are required. In order to ensure that the method operates 
correctly on bent or undulating shapes a process is developed that iden- 
tifies the underlying bending and removes it, straightening out the shape. 
Results are shown on a large range of shapes. 



1 Introduction 

Shape obviously plays an important role in biological vision. However, the task 
of shape perception is inherently complex, as demonstrated by the slow devel- 
opmental process of learning undertaken by children to recognise and use shape. 
At first, they can only make topological discriminations. This is then followed 
by rectilinear versus curvilinear distinctions, then later by angle and dimension 
discrimination, and then continuing to more complex forms, etc. m- 

The application of shape in computer vision has been limited to date by the 
difficulties in its computation. For instance, in the field of content based image 
retrieval, simple methods based on global colour distributions have been reason- 
ably effective m- However, although attempts have been made to incorporate 
shape, they are still relatively crude mu. 

One approach is to simplify the problem of analysing a shape by breaking it 
into several simpler shapes. Of course, there is a difficulty in that the process of 
partitioning will require some shape analysis. However, to avoid a chicken and 
egg problem, low level rules based on limited aspects of shape understanding can 
be used for the segmentation. 

There are numerous partitioning algorithms in the computer vision literature. 
Many are based on locating significant concavities in the boundary pmH|. It has 
been shown that humans also partition forms based on, or at least incorporating, 
such information mm- While this lends credence to such an approach, the 
computational algorithms employed to detect the concavities generally depend 
on measuring the curvature of the shape’s boundary. Unfortunately curvature 
estimates are sensitive to noise. Although the noise can be reduced or eliminated 
by smoothing for instance, it is not straightforward to determine the appropriate 
degree of filtering. In addition, purely local boundary-based measures do not 
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capture the important, more global aspects, of shape. An alternative approach 
that can incorporate more global information operates by analysing the skeleton 
of the shape m Nevertheless, it remains sensitive to local detail and tends 
to be error prone since even small amounts of noise can introduce substantial 
variations into the skeleton. Thus substantial post-processing generally needs to 
be applied in an attempt to correct the fragmentation of the skeleton. 

2 Convexity-Based Partitioning 

To overcome the unreliability of the curvature and skeleton based methods an 
approach to partitioning was developed based on global aspects of the shape H2|. 
The criterion for segmentation was convexity. Although convexity had been used 
in the past, previous algorithms still required various parameters to tune their 
performance, generally to perform the appropriate amount of noise suppres- 
sion In contrast, the only parameter in Rosin’s formulation was the num- 

ber of required subparts. Moreover, an approach for automatically determining 
this number was also suggested. 

The convexity of a shape was measured as the ratio of its area to the the 
area of its convex hull. The total convexity of a partitioned shape was defined 
as the sum of the individual convexity values of the subparts, each weighted by 
their area relative to the overall shape’s area. Thus the convexity and combined 
subpart convexity values range from zero to one. Partitioning was performed 
by selecting the decomposition maximising convexity. As with most partitioning 
schemes, straight lines were used to cut the shape into subparts, and the cuts 
were constrained to lie within the shape. 

The convexity measure is robust since small perturbations of the shape 
boundary only result in small variations in the convex hull. Thus noise has a 
minor effect on the areas of the shape and its convex hull, and therefore on the 
convexity measure itself. Although the measure is based on global properties 
of the shape it produces good partitions, often locating the cuts at significant 
curvature extrema even though no curvature computation is necessary. 



3 Shape Straightening 

Despite its general success, there are also instances in which the convexity based 
scheme fails H2| In figure [D the effect of the ideal cut (shown dotted) would 
be to split the crab shape into the inner convex part and the outer non-convex 
part. The latter would score very poorly according to convexity, and so the crab 
would actually receive a better score without performing any partitioning. This 
is counterintuitive since we would expect that partitioning should always lead to 
simplification. In general, we make the observation that many bent objects will 
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be given a low convexity rating even though human perception might suggest 
that they are suitable representations of simple, single parts. 




Fig. 1. Convexity is not always appropriate for partioning as demonstrated in the crab 
figure. Instead of the ideal cut (shown dotted) the gray cut is selected. 



In this section we describe a solution to overcome the deficiency of convexity 
in this context. Since it is the bending of the shape that creates the problem, the 
bending is removed. Conceptually, if the outer portion of the crab were straight- 
ened it would receive a high convexity score since there would be no concavities. 
Of course, the straightening process should not eliminate all concavities per se 
since these are required to enable the convexity measure to discriminate good 
and bad parts. Instead, the most basic underlying bending should be removed 
while leaving any further boundary details unchanged. 

Some work in computer vision and computer graphics has looked at multi- 
scale analysis and editing of shapes. For instance. Rosin and Venkatesh HSl 
smoothed the Fourier descriptors derived from a curve in order to find “natu- 
ral” scales. The underlying shape was then completely removed by modifying 
the lower descriptors such that on reconstruction just the fine detail occurring 
at higher natural scales was retained and superimposed onto a circle. Another 
approach was taken by Finkelstein and Salesin im who performed wavelet de- 
compositions of curves and then replaced the lower scale wavelets extracted from 
one curve with those of another. Although both methods enabled the high resolu- 
tion detail to be kept while the underlying shape was modified there were several 
limitations. The Fourier based approach operates globally, and therefore assumes 
uniform detail spatially distributed over the curve, which is not necessarily cor- 
rect. Wavelets have the advantage that they cope with spatial localisation, but 
Finkelstein and Salesin did not provide any means for automatically selecting 
which wavelets to retain such that they correspond to significant curve features. 

The approach taken in this paper is to determine the appropriate straight- 
ening of a shape by first finding its medial axis. Its sensitivity to noise can be 
overcome since the axis is only required to describe of the shape at a very coarse 
level, and so heavy smoothing can be applied to eliminate all branches as shown 
in figure |21 More precisely, the boundary is repeatedly smoothed, and at each 
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step the branches in the resulting medial axis are identified by checking for ver- 
tex pixels. If no vertex pixels are found the smoothing terminates and the final 
boundary and axis is returned. Our current implementation uses: Gaussian blur- 
ring of the boundary, Zhang and Suen’s thinning algorithm to extract the medial 
axis EDI, and vertices are identified by checking at each axis pixel for three of 
more black/white or white/black transitions while scanning in rotation around 
its eight neighbours. 



Fig. 2. Repeated boundary smoothing applied nntil all medial axis branches are elim- 
inated. 

Once the axis is found it is used to straighten the shape. First each boundary 
point needs to be assigned to a point on the axis. Conceptually this can be per- 
formed by regenerating the shape by growing the axis. In practise we just run 
a distance transform E| taking the axis as the feature set. In addition to prop- 
agating the distances the originating co-ordinates of the closest feature are also 
propagated. These then provide the corresponding axis points for each boundary 
point. At this stage the smoothed boundary points are still used (after integer 
quantisation) rather than the original boundary set. 

Second, a local co-ordinate frame is determined for each boundary point. The 
frame is centred at the corresponding axis point and orientated to align with the 
local section of axis. The orientation is calculated by fitting a straight line to 
the ten axis pixels on either side of the centre. The position of each point in the 
original shape boundary is now represented in polar co-ordinates with respect 
to the local co-ordinate frame determined for its corresponding smoothed point. 

The third step performs the straightening of the boundary by first straight- 
ening the medial axis. The axis points j/i)i=i..,n are mapped to (f,0), giving 
the straight line (0, 0) — >■ (n, 0). Transforming the local co-ordinate frames to be 
appropriately centred and oriented the transformed boundary points have now 
been straightened. 

An example of the full process is given in figure 0 The irregular map of 
Africa is smoothed until its medial axis represents just the underlying bent 
shape. The distance transform of the medial axis is shown in figure 0: where 
low intensities represent small distances. The final, straightened map of africa in 
figure 01 clearly demonstrates that the original major bend has been removed 




(a) 



(b) 



(c) 



(d) 
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Fig. 3. The straightening process, (a) The irregular outline of the input shape produces 
a skeleton with several branches, (b) The shape is iteratively smoothed until its skeleton 
consists of a single spine, (c) The distance transform of the skeleton is generated, and 
the X and Y co-ordinates of the closest axis point are recorded. This enables the 
appropriate transformation to be applied to the original shape, resulting in (d). 



while the local boundary features have been retained, although slightly distorted 
in some instances. 




Fig. 4. Examples of shape straightening. The first column contains the original shape; 
the second column contains the smoothed shape with the medial axis; and the third 
column contains the straightened shape. 



The validity of the approach is demonstrated on the synthetic examples in 
figure 0 Due to the nature of the data the medial axis is easily found, and reliably 
represents the basic form of the underlying shape. The final results show that 
the straightening is performed correctly. 

Further examples of shape straightening are provided in figure 0 For sim- 
ple elongated shapes the technique is generally successful as the medial axis is 
representative of the bending underlying the shape. Cases in which there are 
several competing axes are more problematic. For instance, in the donkey there 
is a dominant elongated horizontal portion as well as three vertical elongated 
portions (the donkey’s fore feet and rear feet, and the rider). No single unbranch- 
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Fig. 5. Examples of performing shape straightening of natural data 



ing axis can capture all this. Nevertheless, the result successfully straightens the 
donkey’s rear feet and tail even though the rider and fore feet remain protrud- 
ing. A similar partial straightening is seen on the elephant. In some cases local 
distortions are evident. Such errors creep in from a combination of sources such 
as the distance transform approximation, fitting of the local co-ordinate frame, 
and the mapping itself. 

4 Partitioning 

The partitioning algorithm is now complete. Its operation is much as before: 
candidate cuts are assessed and the one maximising the weighted sum of con- 
vexities is selected. However, before calculating convexity the subpart is first 
straightened out. Since this transformation can distort the size as well as shape 
of the subpart the total convexity of the set of subparts is combined using the 
individual subpart convexities weighted according to the relative area of the 
unstraightened subparts. 

The results of the two algorithms are compared in the following figures in 
which the different levels of performance have been grouped. It can be seen that 
in many cases both methods produce the same or very similar results (figures 0 
and Q) . Sometimes the original algorithm is still successful despite the shape 
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containing significant bending. For instance, although the fish’s tail wiggles back 
and forth it is thin. Thus its low area causes its low convexity to only contribute 
weakly to the overall high convexity generated mainly by the highly convex fish 
body. The results in figure Q verify that the incorporation of straightening does 
not prevent the new algorithm from performing satisfactorily. 




Fig. 7. Similar partitioning using convexity in combination with straightening 



Figures 0 and El contain results that differ significantly between the algo- 
rithms, although it is not clear that either one is superior. For instance, the 
original algorithm has cut off one of the curved arms in the second shape. By 
incorporating straightening the new algorithm has managed to successfully com- 
bine both arms into a single part. 

In some cases we find that the addition of straightening worsens the effective- 
ness of the method (see figiiresITnia.nd fTT|l . The head is better partitioned by the 
old algorithm, although the new algorithm’s result is still reasonable. By mak- 
ing a cut from the nose to the back of the head it has created a region that was 
straightened into a fairly convex shape. On the last shape the new algorithm’s 
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Fig. 8. Different partitioning using convexity alone 




Fig. 9. Different partitioning using convexity in combination with straightening 







result is poor, although a contributing factor is that it needs to be partitioned 
into more than two subparts. 




Fig. 10. Better partitioning using convexity alone 




Fig. 11. Worse partitioning using convexity in combination with straightening 



Finally, examples in which straightening has provided a clear benefit are 
given in figures El and El In most cases the failings of using convexity alone 
are self-evident - sections are chopped off with no regard for their fitness as 
subparts. 
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Fig. 13. Better partitioning using convexity in combination with straightening 



5 Discussion 

In this paper we have shown how shapes can be straightened, and how this can 
be applied to aid partitioning. Several issues remain, relating to the efficiency 
and effectiveness of the technique. 

Currently, the straightening process is time consuming, and can take several 
seconds. Since it is applied repeatedly as part of the evaluation of many candidate 
cuts this slows down the overall analysis of a shape containing a thousand points 
to several hours. The actual time depends on the shape since if it contains many 
concavities such as the spiral then many of the trial cuts will lie outside the 
shape and can therefore be rejected without requiring the more time consuming 
straightening and convexity calculations. 

Two approaches to speeding up the process are possible. The first is to im- 
prove the efficiency of the straightening process. The current implementation 
involves some image based operations (for the medial axis calculation and axis 
branch checking) . A significant improvement could be made by determining the 
medial axis directly from the shape boundary. Efficient algorithms exist, in par- 
ticular, Chin et al. p| recently described an algorithm that runs in linear time 
(with respect to the number of polygon vertices). 

Another complementary approach is to apply the convexity calculation only 
at selected cuts. Rather than exhaustively considering all possible pairwise com- 
binations of boundary points as potential cuts, a two stage process can be em- 
ployed. For example, the cuts can be restricted to include only a subset of bound- 
ary points such as dominant (i.e. corner) points. Although corner detectors are 
typically unreliable, if a low threshold is used then the significant points will 
probably be detected, at the cost of also including additional spurious points. 
Alternatively, a simpler but less reliable partioning algorithm can be used to 
produce a set of candidate cuts by running it over a set of parameter values. 
These can then be evaluated and ranked by convexity. 
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At the moment the only speed-up implemented is to process the data at 
multiple scales. First the curve is subsampled, typically at every fourth point. 
The best cut is determined and this initialises a second run at full resolution in 
which only cuts around the best low resolution cut are considered. Currently the 
window centred at the first cut is six times the sampling rate 

Regarding the effectiveness of the straightened convexity measure some im- 
provements could be made. As discussed previously, the measure does not ex- 
plicitly take curvature extrema into account. Nevertheless these are important 
local features even though their reliable detection is problematic. 

On the issue of the saliency of alternative partitionings, Hoffman and 
Singh |3j ran psychophysical experiments to determine three factors affecting 
part salience: relative area, amount of protrusion, and normalised curvature 
across the part boundary. Previously the convexity measure was demonstrated 
to match fairly well on a simple parameterised shape with these human saliency 
judgements m However, there remain examples in which the basic convexity 
and the straighted convexity measures cannot discriminate between alternative 
partitions of different quality. For instance, most humans would judge the first 
segmentation in figure as more intuitive than the second, but after straight- 
ening both receive perfect convexity scores. 




Fig. 14. Alternative partitions with identical straightened convexity ratings 



Finally, the straightening process works well for elongated, bent shapes, but 
can run into problems with shapes containing several competing dominant axes. 
Simplifying the axes in order to remove all the vertices requires a large amount 
of smoothing leading to distortion of the shape. 
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Abstract. In this paper we propose to use invariant signatures of polyg- 
onal approximations of smooth curves for projective object recognition. 
The proposed algorithm is not sensitive to the curve sampling scheme or 
density, due to a novel re-sampling scheme for arbitrary polygonal ap- 
proximations of smooth curves. The proposed re-sampling provides for 
weak-afRne invariant parameterization and signature. Curve templates 
characterized by a scale space of these weak-afline invariant signatures 
together with a metric based on a modified Dynamic Programming al- 
gorithm can accommodate projective invariant object recognition. 



1 Introduction 



An invariant signature of a planar curve is a unique description of that curve 
that is invariant under a group of viewing transformations. Namely, all curves 
which are transformations of each other, have the same signature, whereas the 
signatures of all other curves are different. Invariant signatures may be used to 
index or to detect symmetries of planar curves under 

viewing transformations. 

Invariant signatures of planar curves are the subject of many research papers 
| |1I2I1II8I12I18I19I2()| . to name a few. Generally speaking, in order to describe pla- 
nar curves under a group of transformations, one has to employ two independent 
local descriptors, which are invariant under the required group of transforma- 
tions. Namely, two numbers, which are well defined on small curve segments and 
change whenever the curve is changed, unless the change is a transformation 
from the required group. 

When the two descriptors have a differential formulation, the curve is a so- 
lution of the differential equations and the required initial conditions. Two in- 
dependent descriptors are needed since each limits the locus of the ’next curve 
point’ to a one dimensional manifold (the ’next point’ is the intersection of these 
manifolds) . 

Consider the above mentioned boundary condition. Since the description is 
invariant under a group of transformations, one should be able to reconstruct all 
the instances of that transformation from the same signature (though necessarily 
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from different boundary conditions) . In order to accommodate the dimensional- 
ity of the transformation group^ we necessarily need more boundary conditions 
for more complicated groups. 

It is indeed a known fact that differential signatures of more complicated 
groups of transformations have higher degrees of derivatives (see e.g. P]). This, 
in turn causes numerical problems when one tries to implement the theory for 
complicated groups. 

An accepted solution to the high derivative problem is to trade derivatives for 
features, which are integral in nature, and thus more stable. The integral features 
are local applications of global geometric invariants PEII- They are sometimes 
called semi-differential invariants H21, and in other cases they are formulated 
as stable numerical schemes for differential invariants jtifZ] . For example, setting 
a grid of equally spaced points on the curve is the integral equivalent to the 
Euclidean arclength described differentiallj0 asdsE = \J^p + . Alternatively, 

setting a grid such that the area enclosed between edges and curve segments is 
equal is the integral equivalent to the weak-afhne invariant arclength dSa = 

^ XpYpp — YpXpp. 

Invariant grids imply polygonal curve sampling, though only few go all 
the way analyzing polygonal signatures and specifically polygonal signatures 
of smooth curves. In this context we address two problems: 

— Polygonal Signatures: The simplest parametric curve description is poly- 
gons. Polygonal signatures have been addressed before |5IH| . however, the 
following practical issues have not been addressed: Polygons are often ap- 
proximations of smooth curves, rather then intrinsically polygonal shapes. 
One can not assume invariant curve sampling as in 0. Furthermore, sam- 
pling is not necessarily stationary i.e. different sampling density in different, 
and possibly even within the same polygonal approximation. 

— Complex Transformation Groups: Both differential an non-differential 
descriptors for transformation groups that are more complicated than the 
weak-afhne group are complex and unstable. With the notable exception of 
the robust methods by Weiss PI. who hts canonical curves to discrete curve 
segments. This method can however not be implemented for sparsely sampled 
curves - i.e. polygons. Another effort to approximate complex invariants 
by simpler ones PI. is more in the spirit of this paper. Although we do 
not approximate complex invariants, but rather accommodate the difference 
between the required invariance and the available (simpler) signature by way 
of a tailored metric. 

In this paper we propose to solve these problems. Specihcally we propose 
a re-sampling method for polygonal shapes that is invariant under weak affine 

^ Degrees of freedom, for example: The weak-afhne group has 5 degrees of freedom (2 
for translation, 1 rotation, 1 aspect ration, and 1 skew), the affine group has 6 (5 as 
the weak-affine, plus scale), and the projective group has 8 (6 as the affine group, 
plus 2 for tilt). 

^ We use the following notations Xp = Xpp = 
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transformations. Namely, two polygonal versions of the same curve, or its weak 
affine transform, with arbitrarily different sampling schemes can be re-sampled 
so that the new sampling density on both curves is similar, see Figure d Con- 
sequently the polygonal signatures of the re-sampled polygons are similar. In 
addition we propose a method to use our weak affine signature, to index or 
detect symmetries in curves under affine and projective transformations. 





Fig. 1. Polygonal up-sampling of arbitrarily pre-sampled curves. 



In the next section we describe the proposed polygonal re-sampling. In Sec- 
tion d we present the second invariant descriptor used for signature value. In 
Section d we propose a metric for matching two signatures. Sectionals a brief 
Summary of this paper. 

2 Re-sampling 

Calabi et al. 03 have proposed a sampling method for smooth curves that 
converges to the weak affine arclength. In this section we present a method to 
sample polygonal approximations of curves in a manner that is consistent with 
the weak affine arclength. 

It as already been suggested PC! to sample curves in a manner invariant to 
weak affine transformations by setting sampling points so that the area enclosed 
between the curve and line segments connecting them is constant. However, this 
and other area measures proposed originally for smooth curves do not generalize 
well to polygons, particularly not for up-sampling (i.e. adding vertices). 

The solution proposed in this paper is based on the fact that area is invariant 
to weak affine transformations. Specifically we propose the following scheme 
applicable to polygons as well as smooth shapes, see Figure Et. Given a point S 
on the curve, the next point S' is found by: 

1 . Determining an anchor point A based on an enclosing area of predefined size 
e > 0. 

2. Determining S' such that the line segment based on A and sweeping the 
curve starting at S covers an area of a • e for some a € [0, 1]. 

Note that both A and S' are, by definition, invariant under weak affine trans- 
formations. 
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a b 

Fig. 2. Polygonal re-sampling invariant to weak a ne transformations. 



When e — )> 0 the resulting sub-sample density is proportional to the weak- 
affine arclength, nevertheless, one can use any e > 0 and still keep the invariance 
to weak affine transformations. 

The parameter e is, in a certain sense, a scale parameter, filtering out small 
curve perturbations, so that perturbations as in Figure Eb are filtered out. Since 
our main interest is polygonal shapes, it should be noted that polygons have 
intrinsic artifacts (the vertices), whose influence we need to filter out. A rule of 
thumb to select e is therefore to make it large enough, so that any boundary 
segment delimiting an area of e contains at least two vertices (note that for 
any e > 0 delimiting boundary segments contain at least one vertex). Since 
the re-sampling needs to be invariant to a wide range of original polygonal 
approximations, we need to determine e considering the worst expected sampling 
density in a given application. Let us note again that e can be arbitrarily large 
and still be invariant to weak affine transformations. If for a large e one still 
needs a dense sub-sampling, one can always resort to small a. 



3 Signature Value 

The invariant grid described in the previous section is used as the first invariant 
descriptor (arclength). In this section we describe the proposed second invari- 
ant, the signature value. Like the first invariant used for invariant arclength, the 
signature value is based on a local application of global geometric invariants. 
Specifically, we advance an invariant curve segment on the curve in both direc- 
tions, and define the signature value to be the area of the triangle defined by the 
current, forward, and backward points. 

It is not recommended to use the invariant arclength proposed in the previ- 
ous section as a measure for the forward/backward advance unless the invariant 
distance advanced to either sides is larger than l/a. Otherwise, polygon arti- 
facts might influence the signature value (e.g. all the invariant points might be 
located on the same polygon edge). Therefore, the proposed signature uses the 
anchor point described in the previous section as the forward invariant point. 
The backward point is obtained symmetrically by enclosing an e area in the op- 
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posite direction. The signature value is the area ratio between the triangle area 
and e. Note that: 

— The proposed signature value does not approximate the invariant curvature. 

— For e — >■ 0 on smooth curves the absolute value of the signature converges to 
6, however since in our implementations e is constrained from below, this is 
not the case with the proposed signature. 

— Like the invariant arclength, the proposed signature value is not full affine 
invariant because of the need to determine e. 

4 Signature Matching 

In this section we discuss the metric used to compare signatures. We also show 
that it is possible to use the proposed metric in order to overcome, the otherwise 
complicated problem, of invariance to affine and projective transformations. 



4.1 Weak AfRne Transformations 

Since the object of our study is curves that have already been sampled by ar- 
bitrary methods, two polygonal approximations of the same curve are, strictly 
speaking, different curves, and we cannot expect their invariant signatures to be 
identical. Although, if the scale parameter e is chosen appropriateljH, the sig- 
nature functions of two polygonal instances of the same curve should be similar 
in both the arclength and value dimensions (respectively x and y dimensions in 
the graphs of Figure EJ. 

Value perturbations are trivially dealt with by standard metrics (e.g. P). 
However, to deal with arclength perturbations we have to resort to more compli- 
cated metrics. Figure0is an example of the combined value/ arclength deforma- 
tion problem. It depicts different polygonal approximations of the same smooth 
curve, and the corresponding weak-affine invariant signatures. Notice that ap- 
proximation c is slightly too sparse for the chosen e^, nevertheless, the proposed 
metric will handle this case well. 

We propose to employ a composite measure based on standard metrics (e.g. 
P) both in the value and in the arclength dimensions. The proposed metric is 
the well known warp metric0used in many fields of engineering, see e.g. pmn 
mmm- It is based on the following minimization problem: 

Given two signature functions Vq(i), i G {1,2, . . . Lq} for the query curve 
and Vt(I), i G {1,2, . . . Lt} for the template, we look for the optimal warp 
or reparameterization function W : {1, 2, . . . Tq} — >■ {1, 2, . . . Lt} to minimize 

the composite distance function 

D (Vq, Vt) = mm (Vq(z) - Vr(<F(z)))^ + A ■ (Vtf'(z))"} 

® See discussion in the end of Section El 
Known also as Dynamic Warping, reparameterization, and Viterbi algorithm. 
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Fig. 3. Polygonal approximations of a smooth curve (a. 276, b. 125, and c. 55 vertices) 
and corresponding weak-a ne invariant signatures. 



subject to constraints on (we use: ’f'(l) = 1, and 0 < VS'(z) < 2). In the above 
is a scalar constant, and Vlf'(t) = 'F{i) — 'F{i — 1). 

The optimization problem described above is solved in 0{Lq ■ Lx) time by 
the Dynamic Programming algorithm. Dynamic programming is based mainly 
on the following recursioro 

Given a series of solutions to the warping problems of a given part of the 
query signature Vq(i), i € {1,2,... k}, to all the sub-parts of the template 
signature Vx(*), t G (1, 2, . . . j}, with j G (1,2,... ix}, we can solve the series 
of warping problems from a longer part of the query Vq(i), iG|l,2,... k + 1}, 
to each of the sub-parts of Vx(’)- I^or each of the problems, simply select one of 
the optimal paths of the given set of solutions, and extend it by a single further 
match to minimize the total warp error composed of: (1) The total error at the 
given path (2) The warp error due to the required step (3) The additional match. 
Figured depicts the recursive completion process. 




Fig. 4. Recursive path extension in Dynamic Programming. 



The recursive process is initiated according to the boundary conditions. In 
our implementation we assume iF(l) = 1, and thus |Vq( 1)} matches (Vx(l)}, 



® Refer to for a detailed description of Dynamic Programming. 
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and none of the other sub-parts of Vt- Thus, H ({Vq(1)}, {Vt(1)}) = 0, and 
^ ({^q(1)}; {^t(j)}) = 00 , Vj ^ 1. In real applications one can not take initial 
matching for granted. A more realistic initial condition for signatures on is 
described in EH 

The recursion is terminated similarly, according to the boundary conditions. 
If we insisted that ^{Lq) = Lt, we would have chosen the resulting match of 
the full signature of Vq to the full signature of Vt- However, we chose to apply 
a more relaxed boundary condition, as in and to select the best match of 
the full signature Vq to either of the longer sub parts of Vt- 



4.2 AfRne Transformations 

The weak affine s^nature is invariant to affine transformations only up to the 
scale parameter O Note that the signature’s arclength and value are derived 
from e via area ratios, which are in turn invariant to affine transformations Q . 
Thus, representing a plane curve by a set of signatures representing a range 
of e values, instead of a single signature corresponding to a specific eo, makes 
it possible to identify the planar curve under affine transformations. A similar 
signature scale space has been employed in ^ for a different purpose. 

The practical question is naturally, how many signatures to keep. The range 
of e should be determined by the range of scales relevant to the application in 
mind. Specifically, if we expect the affine scale of queries to be in the range of 
xO.5 to x2 relative to the template, we will need signatures with scale parameters 
in the range 0.25 x eg to 4 x eg. As for the number of e values in the range, they 
have to be selected so that the signatures of intermediate scales will be similar 
to one of the represented signatures. 

Although we have not proven the following conjecture, we found out empir- 
ically, that signatures change slowly with e. Moreover, the fact that we use a 
warp metric reduces the influence of arclength deformations, leaving us mainly 
with the relatively gradual value changes. Figure 0 depicts the way signatures 
change with e. 



a 




Fig. 5. Weak-a ne invariant signatures for di erent e values (a. eo, b. 1.2 x eo). 



If we knew that the a ne scale parameter between the query and the template is e.g. 
f3 we could use 0 ^ - e instead of e, and thus be ’invariant’ to the a ne transformation. 
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4.3 Projective Transformations 

In this section we show how a modified warp distance can accommodate object 
recognition under projective transformations. Let us first define the affine and 
projective transformations in the plane. An affine transform of a point X = 
(x, y)^ is Xa = AX + V, where V is a translation vector, and A a non singular 
2x2 matrix parameterizing rotations , scale, skew, and aspect ratio. A projective 
transform is Xp = + V, where W parameterizes tilt. 

Given a small neighborhood in the plane a projective transformation can 

be approximated by an affine transformation with a scale parameter . 

Thus a projective transform can be approximated by an affine transformation 
with space varying affine scale parameter. For our purposes we have e{X,Y) = 

. Note that for continuous curves e(X,Y) changes continuously on 

the curve. 

Before we detail the proposed projective invariant matching, let us recall the 
affine matching described in Subsection 14.21 where we proposed the following 
procedure: 

Calculate the warp distance from the signature Vq of the query curve to a 
set of signatures 14^™ corresponding to the template curve, and a set of e values. 
The query is considered an affine transformation of the template if at least one of 
the warp distances is below a predetermined threshold. Evidently, it is sufficient 
to consider the smallest warp distance over all m. Thus, the proposed algorithm 
is equivalent to an algorithm combining the dynamic programming algorithms 
of the different warps to a single algorithm matching {Vq} to a huge set of states 
Note that: 

1. The initialization should facilitate equal conditions for matching |Vq(1)}, to 

the initial states {VI}”* (1)} of each of the template’s signatures. 

2. The recursion should restrict path extension to within the same m. 

3. The warp representing the template is the path corresponding to the best 

final state. 

Now consider the case of projective invariant matching. The only difference 
to the above algorithm is the need to enable the recursion step to extend paths 
across similar scales. 

Figure Eldescribes the signature (b) of a projective transformation of a curve 
and the signature corresponding to the best warp into the template’s signature 
scale space. Namely, values of the template signature were compiled by track- 
ing the best-path selected by the algorithm. They were taken from weak-affine 
signatures corresponding to different scales. The signature of the template has 
been slightly lowered, otherwise it would have been difficult to distinct the two 
signatures. This match quality would not have been possible had we limited the 
algorithm to any single scale. 
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Fig. 6. A weak-a ne signature of the projectively transformed curve, and the warped 
template signature (lowered to allow distinction). 



5 Summary 

In this paper we have presented a weak-affine invariant re-sampling method for 
polygonal approximations of smooth curves. The weak-affine signature of the 
resulting polygon is invariant to the original curve sampling method. 

We proposed to use a signature scale space similar to the one described in P| , 
and argued that a metric based on a modified Dynamic Programming algorithm 
accommodates projective invariant object recognition. 
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Abstract. We introduce a rule-based approach for the learning and 
recognition of complex movement sequences in terms of spatio-temporal 
attributes of primitive event sequences. During learning, spatio-temporal 
decision trees are generated that satisfy relational constraints of the 
training data. The resulting rules are used to classify new movement 
sequences, and general heuristic rules are used to combine classification 
evidences of different movement fragments. We show that this approach 
can successfully learn how people construct objects, and can be used to 
classify and diagnose unseen movement sequences. 



1 Introduction 



Over the past years, we have explored new methods for the automatic learn- 
ing of spatio-temporal patterns [11 l2l4ISj . These methods combine advantages of 
numerical learning methods (e.g. |0|) with those of relational learners (e.g. |2|), 
and lead to a class of learners which induce over numerical attributes but are 
constrained by relational pattern models. Our approach. Conditional Rule Gen- 
eration (CRG), generates rules that take the form of numerical decision trees 
that are linked together so that relational constraints of the data are satisfied. 
Relational pattern information is introduced adaptively into the rules, i.e. it is 
added only to the extent that is required for disambiguating classification rules. 

In contrast to Conditional Rule Generation, traditional numerical learning 
methods are not relational, and induce rules over unstructured sets of numerical 
attributes. They thus have to assume that the correspondence between candidate 
and model features is known before rule generation (learning) or rule evaluation 
(matching) occurs. This assumption is inappropriate when complex models have 
to be learned, as is the case when complex movements of multiple limb segments 
have to be learned. Many symbolic relational learners (e.g. Inductive Logic Pro- 
gramming) , on the other hand, are not designed to deal efficiently with numerical 
data. Although they induce over relational structures, they typically generalize 
or specialize only over symbolic variables. It is thus rare that the symbolic rep- 
resentations explicitly constrain the permissible numerical generalizations. It is 
these disadvantages of numerical learning methods and inductive logic program- 
ming that GRG is trying to overcome. 

Since GRG induces over a relational structure it requires general model as- 
sumptions, the most important being that the models are defined by a labeled 
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Fig. 1. Example of input data and conditional cluster tree generated by CRG method. 
The left panel shows input data and the attributed relational structures generated for 
these data, where each vertex is described by a unary feature vector u and each edge 
by a binary feature vector b. We assume that there are two pattern classes, class 1 
consisting of the drinking glass and the mug, and class 2 consisting of the teapot. The 
right panel shows a cluster tree generated for the data on the left. Numbers refer to the 
vertices in the relational structures, rectangles indicate generated clusters, grey ones 
are unique, white one contain elements of multiple classes. Class! cation rules of the 
form Ui — Bij — Uj... are derived directly from this tree. 



graph where relational attributes are defined only with respect to neighbour- 
ing vertices. Such assumptions constrain the types of unary and binary features 
which can be used to resolve uncertainties (Figure 0. 

Recently, we have successively extended the CRG method for learning spatial 
patterns (for learning of objects and recognizing complex scenes) to the learning 
of spatio-temporal patterns. This method, CRGst , was successively applied 
to the learning and recognition of very brief movement sequences that lasted 
up to 1-2 seconds. In this paper, we describe how CRGst can be applied to 
the recognition of very long and complex movement sequences that last over 
much longer time periods. Specifically, we test the suitability of CRGst for the 
recognizing how people assemble fairly complex objects over time periods up to 
half a minute. 

In the following, we introduce the spatial CRG method and the spatio- 
temporal CRGst method. We discuss representational issues, rule generation. 
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and rule application. We then show first results on the application of CRGst to 
the recognition of complex construction tasks. 

2 Spatial Conditional Rule Generation 

In Conditional Rule Generation classification rules for patterns or pattern 
fragments are generated that include structural pattern information to the extent 
that is required for classifying correctly a set of training patterns. CRG analyzes 
unary and binary features of connected pattern components and creates a tree of 
hierarchically organized rules for classifying new patterns. Generation of a rule 
tree proceeds in the following manner (see Figure ^ 1 . 

First, the unary features of all parts of all patterns are collected into a unary 
feature space U in which each point represents a single pattern part. The feature 
space U is partitioned into a number of clusters Ui. Some of these clusters may 
be unique with respect to class membership (e.g. cluster Ui) and provide a clas- 
sification rule: If a pattern contains a part Pr whose unary features u(pr) satisfy 
the bounds of a unique cluster Ui then the pattern can be assigned a unique clas- 
sification. The non-unique clusters contain parts from multiple pattern classes 
and have to be analyzed further. For every part of a non-unique cluster we collect 
the binary features of this part with all adjacent parts in the pattern to form 
a (conditional) binary feature space UBi. The binary feature space is clustered 
into a number of clusters UBij. Again, some clusters may be unique (e.g. clus- 
ters UB22 and UB31 and provide a classification rule: If a pattern contains a 
part Pr whose unary features satisfy the bounds of cluster Ui, and there is an 
other part Ps, such that the binary features b{pr,Ps) of the pair {pr,Ps) satisfy 
the bounds of a unique cluster UBij then the pattern can be assigned a unique 
classification. For non-unique clusters, the unary features of the second part ps 
are used to construct another unary feature space UBUij that is again clustered 
to produce clusters UBUijk- This expansion of the cluster tree continues until 
all classification rules are resolved or a maximum rule length has been reached. 

If there remain unresolved clusters at the end of the expansion procedure 
(which is normally the case) , the clusters and their associated classification rules 
are split into more discriminating rules using an entropy-based splitting proce- 
dure. The elements of an unresolved cluster (e.g. cluster UBU212 in Figure 
are split along a feature dimension such that the normalized partition entropy 
Hp{T) 

Hp{T) = (niiL(Pi) + n2H{P2))/{ni + 712). ( 1 ) 

is minimizes, where H is entropy. Rule splitting continues until all classification 
rules are unique or some termination criterion has been reached. This results in 
a tree of conditional feature spaces (Figure [Q, and within each feature space, 
rules for cluster membership are developed in the form of a decision tree. 

From the empirical class frequencies of all training patterns one can derive 
an expected classification (or evidence vector) E associated with each rule (e.g. 
E{UBU 212) = [ 0 . 5 , 0 . 5 ]), given that it contains one element of each class). Simi- 
larly, one can compute evidence vectors for partial rule instantiations, again from 
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empirical class frequencies of non-terminal clusters (e.g. E{UB 2 i) = [0.75, 0.25]). 
Hence an evidence vector E is available for every partial or complete rule in- 
stantiation. 

3 Spatio-Temporal Conditional Rnle Generation: CRGst 

We now turn to CRGst , a generalization of CRG from a purely spatial domain 
into a spatio-temporal domain. Here, data consist typically of time-indexed pat- 
tern descriptions, where pattern parts are described by unary features, spatial 
part relations by (spatial) binary features, and changes of pattern parts by (tem- 
poral) binary features. In contrast to more popular temporal learners like hidden 
Markov models |H| and recurrent neural networks 0, the rules generated from 
CRGst are not limited to first-order time differences but can utilize more distant 
(lagged) temporal relations depending on data model and uncertainty resolution 
strategies. At the same time, CRGst can generate non-stationary rules, unlike 
e.g. multivariate time series which also accommodate correlations beyond first- 
order time differences but do not allow for the use of different rules at different 
time periods. 

We now discuss the modifications that are required for CRG to deal with 
spatiotemporal patterns, first with respect to pattern representation and then 
with respect to pattern learning. This should give the reader a good idea of the 
representation and operation of CRGst • 

Representation of Spatio-Temporal Patterns 

A spatio-temporal pattern is defined by a set of labeled time-indexed attributed 
features. A pattern Pi is thus defined in terms of Pi = {pn{a : Uj), . . . ,pin{a : 
tin)} where Pij{a : tij) corresponds to part j of pattern i with attributes a that 
are true at time tij. The attributes a — {u,bg,bt} are defined with respect to 
specific labeled features, and are either unary (single feature attributes) or binary 
(relational feature attributes), either over space or over space-time (see Figure |21). 
Examples of unary attributes u include area, brightness, position; spatial binary 
attributes bs include distance, relative size; and temporal binary attributes bt 
include changes in unary attributes over time, such as size, orientation change, 
or long range position change. 

Our data model, and consequently our rules, are subject to spatial and tem- 
poral adjacency (in the nearest neighbour sense) and temporal monotonicity, i.e. 
features are only connected in space and time if they are spatially or temporally 
adjacent, and the temporal indices for time must be monotonically increasing 
(in the “predictive” model) or decreasing (in the “causal” model) . Although this 
limits the expressive power of our representation, it is still more general than 
strict first-order discrete time dynamical models such as hidden Markov models 
or Kalman filters. 

For GRGst finding an “interpretation” involves determining sets of linked 
lists of attributed and labeled features, that are causally indexed (i.e. the tem- 
poral indices must be monotonic), that maximally index a given pattern. 
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Fig. 2. A spatio-temporal pattern consisting of three parts over three time-points. 
Undirected arcs indicate spatial binary connections, solid directed indicate temporal 
binary connections between the same part at di erent time-points, and dashed directed 
arcs indicate temporal binary connections between di erent parts at di erent time- 
points. 



Rule Learning 

CRGst generates classification rules for spatio-temporal patterns involving a 
small number of pattern parts subject to the following constraints: First, the 
pattern fragments involve only pattern parts that are adjacent in space and 
time. Second, the pattern fragments involve only non-cyclic chains of parts. 
Third, temporal links are followed in the forward direction only to produce 
causal classification rules that can be used in classification and in prediction 
mode. 

Rule learning proceeds in the following way: First, the unary features of all 
parts (of all patterns at all time points), u{pit), i = l,...,n, t = 1,...,T, 
are collected into a unary feature space U in which each each point represents 
the feature vector of one part at one time point. From this point onward, cluster 
tree generation proceeds exactly as described in Section El except that expansion 
into a binary space can now follow either spatial binary relations bg or temporal 
binary relations bt- Furthermore, temporal binary relations bt can be followed 
only in strictly forward direction, analyzing recursively temporal changes of ei- 
ther the same part, bt{pu,Pit+i) (solid arrows in FigureEl), or of different pattern 
parts, bt{pit,pjt+i) (dashed arrows in FigureEl) at subsequent time-points t and 
t -b 1. Again, the decision about whether to follow spatial or temporal relations 
is simply determined by entropy-based criteria. 

4 Rule Application 

A set of classification rules is applied to a spatio-temporal pattern in the following 
way. Starting from each pattern part (at any time point), all possible sequences 
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(chains) of parts are generated using parallel, iterative deepening, subject to 
the constraints the only adjacent parts are involved and no loops are generated. 
(Note that the same spatio-temporal adjacency constraints and temporal mono- 
tonicity constraints were used for rule generation.) Each chain is classified using 
the classification rules. Expansion of a chain Si = <Pn,Pi 2 , ■ ■ ■ ,Pin > terminates 
if one of the following conditions occurs: 1) the chain cannot be expanded with- 
out creating a cycle, 2) all rules instantiated by Si are completely resolved (i.e. 
have entropy 0), or 3) the binary features bs(Pij,Pij+i) or bt{pij,pij+i) do not 
satisfy the features bounds of any rule. 

If a chain S cannot be expanded, the evidence vectors of all rules instantiated 
by S are averaged to obtain the evidence vector E{S) of the chain S. Further, 
the set Sp of all chains that start at p is used to obtain an initial evidence vector 
for part p: 

^ E (2) 

where |5| denotes the cardinality of the set S. Evidence combination based on 
(0 is adequate, but can be improved by noting that nearby parts (both in 
space and time) are likely to have the same classification. To the extent that 
this assumption of spatio-temporal coherence is justified, the part classification 
based on Q can be improved. 

We use general heuristics for implementing spatio-temporal coherence among 
pattern parts, one such rule is based on the following idea. For a chain Si = < 
Sii, Si 2 , . . . , Sin >, the evidence vectors E{sn), E{si 2 ), . . . , E(sin) are likely to 
be similar, and dissimilarity of the evidence vectors suggests that Si may contain 
fragments of different movement types. This similarity can be captured in the 
following way (see nm for further details): For a chain Si =< pn,Pi2, ■■■,Pin >, 

1 ” 

w{S,) = -y^Eip^k) (3) 

where E(pi^.) refers to the evidence vector of part pi^. Initially, this can be found 
by averaging the evidence vectors of the chains which begin with part pn~ . Later, 
the compatibility measure is used for updating the part evidence vectors in an 
iterative relaxation scheme 



£;(*+i)(p)^^ 1 ^ (4) 

\ SGSp / 

where ^ is the logistic function <?(z) = (l-|-exp[— 20(z— 0.5)])“^. Z a normalizing 
factor, and where the binary operator ® is defined as a component-wise vector 
multiplication [a 6]^ (g) [c = [ac bcf^ . Convergence of the relaxation scheme 
Elis typically obtained in about 10-20 iterations. 
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5 Learning to Recognize Complex Construction Tasks 



Learning and recognition was tested with an example where a person constructed 
three different objects (a “sink”, a “spider” and a “U-turn”) using pipes and 
connectors (see Figure^). Each construction took about 20-30 s to finish, and was 
repeated five times. Arm and hand movements were recorded using a Polhemus 
system mi with four sensors located on the forearm and hand of both arms. 
The sensors were recording at 120 Hz, and the system was calibrated to an 
accuracy of ±5 mm. From the position data {x{t),y{t), z{t)) of the sensors, 3D 
velocity v{t), acceleration a{t), and curvature k{t) were extracted, all w.r.t. arc 
length ds{t) = + dy'^{t) + dz'^{t))^^'^ ^ 2 ). Sample time-plots of these 

measurements are shown in Figure ^ These measurments were smoothed with 
a Gaussian filter with a = 0.25s (see Figure El and then sampled at intervals of 
0.25s. 




Fig. 3. Pictures of the construction tasks learned by CRGst . The leftmost image 
shows the starting position. The Polhemus movement sensors can clearly be seen on 
the left and right forearms and hands. The other three images show one stage of the 
three construction tasks used, the “sink” construction, the “spider” construction, and 
the “U-turn” construction. Each construction took about 20-30 seconds to finish. 



The spatio-temporal patterns were defined in the following way: At every 
time point t, the patterns consisted of four parts, one for each sensor, each part 
being described by unary attributes u = [v,a,k]. Binary attributes were defined 
by simple differences, i.e. the spatial attributes were defined as bs{pit,Pjt) = 
u{pjt) — u{pit), and the temporal attributes were defined as bt{pit,Pjt+i) = 
u{Pjt+i) - u{pu). 

Performance of CRGst was tested with a leave-one-out paradigm, i.e. in 
each test run, the movement of all construction tasks were learned using all but 
one sample, and the resulting rule system was used to classify the remaining 
instance. Learning and classification proceeded exactly as described Sections 0 
and El with rule length restricted to five levels (i.e. the most complex rules were 
of the form UBUBU). For the parameter values reported before, 73.3% of the 
tasks were correctly recognized on average. 

An example of a classification rule generated by CRGst is the following rule 
that has the form U — Bt — U — Bt — U, where V = velocity, A = acceleration. 
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Fig. 4. Time plots for v(t) of the right hand, for sink construction, spider con- 
struction and U-turn construction from left to right. The time range is 22 seconds 
and velocity has been normalized. The top row shows the raw input data, the bottom 
row the same data after Itering with a Gaussian Iter with a = 0.25s. Each plot shows 
ve replications. 



and AA = acceleration difference over time, and AK = curvature difference over 
time: 

if Ui{t) -1.65 < U < 2.43 

and 1) -0.57 < ZiFf < 1.14 

and Uj{t + 1) -1.65 < < 0.78 

and l,t-b 2) -2.49 < AK < 0.73 and 1.56 < AA < 2.92 

and Uklt + 2) 1.7<U<2.4 

then this is part of a “spider” construction 

CRGst makes minimal assumptions about the data. First, given that it clas- 
sifies data within small temporal windows only, it can classify partial data (e.g. 
a short subsequence of a construction task). Second, it can easily deal with 
spatio-temporal mixtures of patterns. In other words, it can equally well classify 
sequences of different construction tasks (e.g. a person starting one task and 
continuing with another, or a person starting one task and then doing some- 
thing completely different) or even two persons doing different constructions 
at the same time. Obviously, one could incorporate stronger constraints into 
CRGst (e.g. incorporating the assumption that only a single construction task 
is present) and thus improve classification performance further. This is, however, 
not our intent, as we plan to use CRGst to detect and diagnose tasks that are 
only partially correct, i.e. where some parts of the construction task are done 
incorrectly or differently. 
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6 Conclusions 

Most current learners are based upon rules defined iteratively in terms of ex- 
pected states and/or observations at time t+1 given those at time t. Examples 
include hidden Markov models and recurrent neural networks. Although these 
methods are capable of encoding the variations which occur in signals over time 
and can indirectly index past events of varying lags, they do not have the explicit 
expressiveness of CRGst for relational time-varying structures. 

In the current paper, we have extended our previous work on the recogni- 
tion of brief movements to the recognition of long and very complex movement 
sequences, as they occur in construction tasks. We have shown that CRGst can 
successfully deal with such data. There remains, however, much to be done. One 
obvious extension is to extend CRGst to the analysis of multiple, concurrent 
time scales that would allow a hierarchical analysis of such movement sequences. 
A second extension will involve an explicit representation of temporal relations 
between movement subsequences, and a third extension involves introducing 
knowledge-based model constraints into the analysis. 

In all, there remains much to be done in the area of spatio-temporal learning, 
and the exploration of spatio-temporal data structures which are best suited to 
the encoding and efficient recognition of complex spatio-temporal events. 
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Abstract. This article proposes to use both theories of possibility and 
rough histograms to deal with estimation of the movement between two 
images in a video sequence. A fuzzy modeling of data and a reasoning 
based on imprecise statistics allow us to partly cope with the constraints 
associated to classical movement estimation methods such as correlation 
or optical flow based- methods. The theoretical aspect of our method will 
be explained in details, and its properties will be shown. An illustrative 
example will also be presented. 



1 Introduction 

In a static scene, the movement of a camera entails an apparent motion on the 
video sequence it acquires. Phenomena such as occlusions, moving objects or 
variations of the global illumination can involve parasite motion. Classical meth- 
ods dealing with apparent motion estimation aim at finding the main apparent 
motion. They can be divided in two different approaches: correlation-based and 
optical flow-based approaches. The validity of these approaches relies on strong 
hypothesis which are frequently transgressed, thus limiting their reliability. 

For the matching methods it is necessary to have a model of the motion to 
estimate. This kind of method requires a discretization of the search area. This 
will limit the precision of the estimation to the sampling interval of the motion 
parameter’s space. Besides, the correspondence between two images is said to 
exist. Hence, the method will return an estimation even for two totally different 
images. The user has then to choose a threshold under which the estimation 
is considered irrelevant. The estimation is less robust because of the threshold 
arbitrarily chosen and not set by the data. 

Methods based on the optical flow computation are among the most stud- 
ied for main motion estimation fp. The optical flow links the spatio-temporal 
gradients of the irradiance image using the constraint equation: 

Ex * u + Ey * V + Et = ^ (1) 

where E^, Ey are the spatial gradients and Et the temporal gradient of the 
luminance; (u, v) is the projection of the 3D velocity held in the focal plane; ^ 
is the variation of the global illumination. 
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The computation of optical flow is based on the image irradiance continuity. 
Thus, only small movements can be estimated. Moreover this approach is based 
on two antagonist assumptions. On the one hand the image must be sufflciently 
textured so that the motion can be visible; on the other hand, the computation 
of local gradients E^, Ey and Et is made through a low-pass Alter which requires 
a low texture of the image. 

The effects of these constraints can be reduced. For example Bouthemy in 
|E] proposes to use robust statistics to deal with the contamination of the main 
motion due to the parasite motion of moving objects. He also proposes a mul- 
tiresolution process to cope with the small displacement constraint. 

This article presents a new method, based on possibility theory and rough 
histograms, to estimate the main motion (rigid transformation between two im- 
ages of a video sequence). It is structured as follows: some updates about fuzzy 
concepts are given in section2. Section 3 briefly presents rough histograms and 
explains their extension in 2D. Section 4 deals with the method of motion estima- 
tion in details. Section 5 explains how a multiresolution process can improve the 
method. In section 6 some results are presented. Finally in section 7, a conclusion 
and an extension of this work are proposed. 

2 Update on Fuzzy Concepts 

Some of the tenets of fuzzy subset theory are now reviewed to set the stage for 
the discussion that follows. Further details on fuzzy subsets are given in 0. 

2.1 Imprecise Quantity Representation 

Fuzziness is understood as the uncertainty associated with the definition of ill- 
defined data or values. A fuzzy subset of a set J7 is a mapping (or membership 
function) from Q to [0;1]. When 17 = K {i.e. data are real numbers) then a 
fuzzy subset is called a fuzzy quantity. Crisp concepts as intervals can be easily 
generalized as fuzzy quantities. A fuzzy interval is a convex fuzzy quantity, that 
is, one whose membership function is quasiconcave: 

yu,v,\/w G [m,u],^q(u>) > (2) 

A fuzzy interval is fully characterized by its core and support (Fig. 12 . 1^ . A fuzzy 
interval with a compact support and a unique core value is called a fuzzy number 

(Fig.EU. 

2.2 Restricted Possibility and Necessity 

To compare two imprecise data (H and D), a solution given by 0, is to use two 
measures: the possibility of D with respect to H {II {H] D)) and the necessity of 
D with respect to H {N{H; D)) defined by: 



n{H;D) = supmin(/Xij(w);^£)(a;)) 



(3) 
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Fig. 1. Fuzzy quantities: (a) Fuzzy interval (b) Fuzzy number. 



= infmax(^_f/(w); 1 - ^d(w)) (4) 

UJ 

The measure estimates how it is possible for H and D to refer to 

the same value u). N{H\D) estimates how it is certain that the value to which 
D refers to is among the ones compatible with H 

3 Rough Histograms 

3.1 l-D Rough Histograms 

Rough histograms are a generalization of crisp histograms |^. Hence they can 
deal with imprecise data and tend to reduce the influence of the arbitrary choice 
of space partitioning. Using a rough histogram amounts to replacing the classical 
partition on which it is built by a fuzzy one, and using an imprecise accumula- 
tor. This accumulator is defined by its two boundaries called lower and upper 
accumulator. 

Let {Di)]sf {i = 1...N) be N imprecise data whose density of probability has 
to be estimated on the interval [cmin, Cmax]- Rough histograms give an imprecise 
estimation of this density of probability on the considered interval. This density 
is approximated by two accumulators built on a fuzzy partition on [eminj Cmax] 
of p cells Hk (Fig. I.S.ll) . 

Each cell Hk is associated with an upper accumulator Acck and a lower 
accumulator AcCk defined by : 



N 

= (5) 

i=l 

N 

Acck = YN{Hk-,Dj) ( 6 ) 

i=l 

where H{Hk]Di) is the possibility of Hk with respect to Di and N{Hk]Di) is 
the necessity of Hk with respect to Di |2|. 
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3.2 Rough Histograms in 2D 

If the data whose density has to be estimated is a two-dimensional one, the 1-D 
partition has to be replaced by a 2-D partition {[emin, Smax] * [f min, f max])- If 
the t-norm min is used, the membership function is pyramidal iFig. 13. ill . 

The formula 0 and El becomes: 



N M 



Acck,i = 


(7) 


i=l j=l 




N M 




Acck,i = y ,N{H(k,iy,D(^i^j)) 


(8) 


i=l j=l 




n{H(k,iy, = min (77 [Hk] A) ; n {Hi] Dj)) 


(9) 


N{H,^kyyD(,^,)) = min (TV (77^; A) ; N {Hi]D^)) 


(10) 



where x Dj, (Di,Dj) are intervals and x is the Cartesian product 

obtained with the t-norm min. 



4 Presentation of the Method 

The method presented is akin to the correlation- and optical flow methods. It 
requires a discretization of the parameters’ space. However, rough histograms 
are used to reduce the effects due to this discretization. Moreover, the vagueness 
induced on the density estimation by the data imprecision can be easily taken 
into account by the imprecise accumulation process. Furthermore the constraint 
linking spatial and temporal changes in the image irradiance, has been released: 
variation of gray level pixel values are observed through two dual classes. This 
enhances the robustness of the estimation regarding to the motion’s assumptions. 

The process described can be decomposed in three modules: the first one gives 
an estimation of the eventuality of the pixel’s spatial change based on irradiance 
temporal change. This estimation is used by a second module to built a rough 
histogram in the parameter’s space. Then, the last module gives an estimation 
of the main apparent motion. 
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of a pixel 

Fig. 3. Fuzzy membership function to pixels’ class. 



4.1 Observations Processing 

This part explains how spatial change is related to irradiance change. As a matter 
of fact, the optical flow-based methods estimate the projection of the motion 
on the gradient of the image irradiance, while correlation-based methods use a 
statistical distance between the gray level pixel values. We propose to estimate 
the contingency of the displacement (u, u) of a pixel (i,j), while looking if the 
pixels {i,j) of the first picture and {i + u,j + v) of the second one, belong to 
the same class. For this purpose, the gray level space is divided in 2 fuzzy dual 
classes : black pixels and white pixels (Fig. |3). The eventuality of displacement 
- or contingency - is measured in an imprecise manner with two antagonist 
values. These values are the displacement possibility and the non-displacement 
possibility. 

The displacement {u, v) of the pixel (*, j) can be planned if (i,j) and {if, jf) = 
{i + u,j + v) belong to the same class. Likewise, the displacement cannot be 
planned if (i,j) and {if,jf) belong to different classes. This can be written as: 

En{u, v) = max(min(/rjvi, Mv 2 ); min(/rBi, /xsa)) (H) 

Ejj{u, v) = max(min(/rjvi, min(/rsi, /rv 2 )) (12) 

where /rjvi(resp. is the pixel {i,j) membership degree of the black pixel 

class (resp. white pixel class) and fj,N 2 (resp. ^ 32 ) is the pixel {if,jf) membership 
degree of the black pixel class (resp. white pixel class). Rather than using Efj, 
we prefer using its complement defined by: 

En = 1-E}j (13) 

This process provides an upper and lower measure of the reliance in the 
displacement {u,v). This method is akin to optical flow-based ones in trying 
to link spatial and temporal variations of irradiance together. Meanwhile no 
assumption on the image’s texture is claimed. 

4.2 Matching between Observation and the Motion Model 

The main apparent motion on image space due to the motion of a camera in 
3-D environment can generally be modeled with 6 parameters (2 translations, 3 
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Fig. 4. Compatibility planned displacement / motion model. 



rotations, and a scaling factor). However, 0 has shown that an estimation with 
a 4-parameter model is more robust. These parameters are: 2 translations on 
each axes of the image, 1 rotation around the normal to the focal plane and a 
scaling factor representing the variations of distance between the scene and the 
camera. To simplify the statement of the method, this paper will focus on a 2- 
parameter movement (Tcc, Ty). In a further paper, a method with a 4-parameter 
movement will be presented. 

A rough histogram built upon the parameters’ space is used to estimate the 
main mode of the motion. For this purpose, the planned displacement of each 
pixel computed at the previous stage is used to build the imprecise accumulator. 
Then, the chief mode of the density gives an estimation of the main motion. 

As mentioned in chapter^ a discretization of the parametric space (Tx, Ty) 
is needed. We then create a 2-dimensional fuzzy partition built upon the dis- 
placement parameters’ space. Let H^ i be the cell (fc,l) of the fuzzy partition 
(fc = 1...K and I = 1...L) whose core is reduced to its mode {Txk,Tyi) and 
whose support is defined by: 

|-fffe,zlo = [Txk - ATx,Txk + ATx] x [Tyi - ATy.Tyi + ATy] (14) 

where ATx, ATy are the spreads of the cell Hk.i- The spatial position of a 
pixel (i,j) is considered as an imprecise quantity of It is usually modeled 
by a fuzzy number whose mode is located at the center of the pixel and whose 
spreading is linked to the pixel’s width. 

As u and v are defined by the subtraction of imprecise quantities (u = 
{i — if),v = {j — jf)), they are also imprecise quantities {U, V) characterized by 
their supports [umin,urnax\,[vmin,vrnax\ and by their modes. The displace- 
ment {u,v) is then modeled by a pyramidal fuzzy number D = U y~V .It \s now 
necessary to find a relation between a planned movement (m,u) of a pixel (f, j) 
and the motion’s model (Tx,Ty). The compatibility between this 2 magnitudes 
is estimated in a bi-modal way by: D) and N{Hk^i; D) which are the 

possibility and the necessity of the cell - representing the motion (Tx,Ty) 
- with respect to the planned movement (u,v). 

We now assume that the contingency of the motion (u,v) for a pixel (i,j) 
has been evaluated using (1771) and (HSl. Then the pixel {i,j) will vote for the cell 
Hk^i under assumption of a motion (u, v) if this displacement can be planned 
and if such a motion is compatible with the overall motion (Tx,Ty). 

Like the displacement contingency evaluation, two votes are obtained for the 
cell Hk^i which are accumulated in the upper and lower accumulators. These 
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Most favorable vote = mm{En{u,v)\ D)) (15) 

Less unfavorable vote = min(i?Ar(M, u); D)) (16) 

This procedure is repeated for each pixel of the image, for each planned 
displacement (u, v) and for each cell Hj^ i of the fuzzy partition. The two accu- 
mulators divided by the total number of votes set an upper and lower boundary 
of the motion probability density. One of the major changes of this bi-modal 
representation is that it provides an estimation of the motion but also a confi- 
dence in this estimation. That depends on the gap between the upper and lower 
accumulators. 

4.3 Motion Estimation 

Finding the main mode in a classical histogram consists in searching the bin 
whose accumulator is maximal. The precision of this localization is related to 
the spread of the histogram’s bins. Using rough histograms almost comes down 
to the same thing. However, the precision of the mode’s localization is more 
accurate than the one imposed by the partition. The search for the maximum 
involves a kind of interpolation between the discrete values of the partition . 

That is, searching for a mode in a histogram amounts to searching for the 
position of a crisp or fuzzy interval of precision T whose quantity of votes is 
maximal - locally or globally. The number of votes purporting to this interval <1> - 
given the distribution of votes on the fuzzy partition - has then to be estimated. 
This estimation is obtained by transposing the imprecise number of votes to- 
wards the interval This transfer uses pignistic probability 1^ □ to provide 
a reasonable interval [Nb{<P), Nb{<P)] of the sought after number of votes. This 
transfer can be written as: 

Nb{^) = Y, BetP(<P/Hk,i)NbY) (17) 

i=l:m 

j=l:n 

Nb{$) = ^ BetP{^/Hk,i) Nb(^k,i) (18) 

i=l:m 

j=l:n 

with Nb(k^i') (resp. q) the value of the histogram’s upper accumulator (resp. 
lower accumulator) associated to the cell (k,l). 

The pignistic probability is defined by: 

BetP{<PlHk,i) = (19) 



where |H| means A cardinality. 
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Using the pignistic probability amounts to transferring some of the votes in 
favor of Hk,i towards ^ in the overlapping proportion of Hkj and 

The position of the maximum of {Nb{<P) + Nb{'P)) is sought to find the chief 
motion corresponding to the maximum of an ’’average probability”. Formulas 
(EJ and l l 1 811 are functions of Pos{<P) - position of the interval The position 
of <P fulfilling: 



d{Nb{<P) + Nb{<P)) 
dPos{<P) 



( 20 ) 



corresponds to an extremum of the number of votes according to the imprecise 
accumulator. If this extremum is a maximum, its position then corresponds to 
the apparent motion between the two images. 

The pignistic probability involves the computation of the volume of intersec- 
tion between a pyramid and a volumetric shape representing the membership 
function of 'P. Even though this membership function is selected as the most sim- 
ple ever - parallelepipedal shape - the computation of the pignistic probability is 
very time consuming. Increasing the number of parameters of the motion model 
will end up complicating or even preventing the pignistic probability computa- 
tion. 

Thus, using rough histograms transform non- monotonous data into 
monotonous ones just like classical histograms transform non-regular data into 
regular ones. If the data are monotonous within the interval \Tx — + 

ATx\ X \Ty — ATy,Ty + ATy], then the maximum of (Nb{<P) + Nb{(l>)) can be 
locally found on the projections of [T^ — AT^, + AT^] x [Ty — ATy, Ty + ATy], 
The problem then comes down to searching for two 1-D modes ; this research 
process is explained in 0. This property reduces the complexity of computation 
brought about by the use of pignistic probability. 



5 Bi-Modal Multiresolution Process 

The complexity of this algorithm increases with the image size, the number 
of planned displacement and the number of histogram’s cells. The computation 
time would be prohibitive when large displacement are planned. This complexity 
can be reduced with a multiresolution process. 

Classical multiresolution methods consist in reducing the amount of data by 
averaging gray level values of a pixel unit. The multiresolution process we use is 
slightly different. To keep the imprecision information about the pixels’ gray level 
value, a 2x2 pixels unit gray level value will be represented by two magnitudes: 
the pixel unit possible membership of the black-pixel and white-pixel classes. 
This two values are defined as the maximum of the 4 pixels membership to the 
black-pixel (resp. white-pixel) class. 

At the finest resolution, this two boundaries are reduced to the same value. 
The multiresolution process aims at getting an incremental estimation of the 
apparent motion through a pyramidal process in a coarse to fine refinement. 
It uses the following property: a displacement (u, v) at a given resolution N-1 
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Fig. 5. Bi-modal multiresolution process. 




amounts to a displacement {2 * u,2 * v) at a resolution N. Thus, large displace- 
ments can be estimated without increasing the computation time, i.e., using a 
reduced number of histogram bins. 



6 Some Results 

The mosaic presented here (Fig. EJ has been run out on a 30-images sequence 
shot in front of the LIRMM. The video camera operates a counter-clock-wise and 
then clock- wise rotation around the vertical axis. This camera being held at arm’s 
length, the image is not steady and the sequence has many vertical variations. 
Finally the images recording rate being rather slow, the global brightness varies 
a lot between 2 successive images. 

The mosaic is created by estimating the displacement (Tx,Ty) between two 
consecutive images and by superposing them on the resulting image. Fig. El 
presents the readjusted superposition of the 30 images. We can notice that: 

- The first and the last images of the sequence are perfectly superposed while 
only linked through the 28 intermediate images; this estimation is then accurate 
and seems not to be biased. 

- The motion model - two translations - is not a good match to real motion 
(motion includes rotations); this approximation does not bother the method. 
This shows its robustness as far as the model is concerned. 

- The variations of brightness and the stream of people coming in the lab does 
not disturb the detection of the main motion mode; this shows the robustness 
of the method as far as the data contamination is concerned. 

7 Perspectives and Conclusion 

A new method of main motion estimation between two images has been presented 
in this paper. This method partly copes with small displacements assumptions 
and with texture-linked constraints. A discretization of motion parameters space 
is needed. However, using rough histograms involves some kind of natural inter- 
polation, leading to a motion estimation less sensitive to discretization. This 
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Fig. 6. Mosaic picture of the LIRMM. 



estimation method seems to be robust regarding the model and data contami- 
nation. Moreover, reliability in the detection of the main motion mode allows to 
qualify the given estimation. 

Nonetheless, this method has some drawbacks. Defining an arbitrary research 
area is necessary and leads to a limitation of the motion to be estimated. Search- 
ing the main mode of the votes’ distribution provides a reliable estimation of 
the main apparent motion if the overlapping between two consecutive images is 
about 80%. Under this overlapping rate, the probability of detecting the mode 
is bigger than the detection’s own precision, so detection is not guaranteed. Fi- 
nally, the computing time needed for this algorithm is rather long, which can 
confine its application field. This last aspect is being improved. 

In a further paper the method will be extended to a 4-parameter motion 
model. We will also use it for a purpose of disparity estimation between two 
stereo images. Finally, the theoretical research on imprecise probability needs to 
be further looked upon. 



References 

1. B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 
17:185-203, 1981. 

2. J. M. Odobez and P. Bouthemy. Robust multi-resolution estimation of parametric 
motion models applied to complex scenes. Pi 788, IRISA, 1994. 

3. D. Dubois and H. Prade. Possibility Theory An Approach to Computerized Pro- 
cessing of Uncertainty. Plenum Press, 1988. 

4. D. Dubois, H. Prade, and C. Testemale. Weighted fuzzy pattern matching. Fuzzy 
sets and systems, 28:313-331, May 1988. 

5. O. Strauss, F. Comby, and M. J. Aldon. Rough histograms for robust statistics. In 
International Conference on Pattern Recognition, volume 2, pages 688-691. lAPR, 
September 2000. 





Possibility Theory and Rough Histograms for Motion Estimation 483 



6. P. Smets and R. Kennes. The transferable belief model. Artificial Intelligence, 
66:191-243, 1994. 

7. T. Denoeux. Reasoning with imprecise belief structures. Technical Report 97/44, 
Universite de Technologie de Compiegne, Heudiasyc Laboratory, 1997. 




Prototyping Structural Shape Descriptions 
by Inductive Learning 



L.P. Cordelia', P. Foggia', C. Sansone', F. Tortorella", and M. Vento' 

'Dipartimento di Informatica e Sistemistica 
Universita di Napoli “Federico 11” 

Via Claudio, 21 1-80125 Napoli (Italy) 

E-mail : { cordel , f oggiapa, car losan, vento }®unina . it 

^Dip. di Automazione, Elettromagnetismo, Ing. delFInformazione e Matematica Industriale 
Universita degli Studi di Cassino 
via G. di Biasio, 43 1-03043 Cassino (Italy) 

E-mail: tortorella@unicas.it 



Abstract. In this paper, a novel algorithm for learning structured 
descriptions, ascribable to the category of symbolic techniques, is 
proposed. It faces the problem directly in the space of the graphs, by 
defining the proper inference operators, as graph generalization and 
graph specialization, and obtains general and coherent prototypes with a 
low computational cost with respect to other symbolic learning systems. 
The proposed algorithm is tested with reference to a problem of 
handwritten character recognition from a standard database. 



1. Introduction 

Graphs enriched with a set of attributes associated to nodes and edges (called 
Attributed Relational Graphs) are the most commonly used data structures for 
representing structural information, e.g. associating the nodes and the edges 
respectively to the primitives and to the relations among them. The attributes of the 
nodes and of the edges repr esent the properties of the primitives and of the relations. 

Structural methods | |1|2[ | imply complex procedures both in the recognition and in 
the learning process. In fact, in real applications the information is affected by 
distortions, and consequently the corresponding graphs result to be very different 
from the ideal ones. The learning problem, i.e. the task of building a set of prototypes 
adequately describing the objects (patterns) of each class, is complicated by the fact 
that the prototypes, implicitly or explicitly, should include a model of the possible 
distortions. The difficulty of defining effective algorithms for facing this task is so 
high that the problem is considered still open, and only few proposals, usable in rather 
peculiar hypotheses, are now available. 

A first approach to the problem relies upon the conviction that structured 
information can be suitably encoded in terms of a vector, thus making possible the 
adoption of statistical/neural paradigms; it is so possible to use the large variety of 
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well-established and effective algorithms both for learning and for classifying 
patterns. The main disadvantage deriving from the use of these techniques is the 
impossibility of accessing the knowledge acquired by the system. In fact, after 
learning, the knowledge is implicitly encoded (e.g. within the weights of connections 
of the net) and its use, ou tside classification stage, is strongly limited. Examples of 
this approaches are |2|3|4| . 

Another approach, pioneered by j^, contains methods faci ng the learning problem 
directly in the representation space of the structured data | |6|7|8] . So, if data are 
represented by graphs, the learning procedure generate graphs for representing the 
prototypes of the classes. 

Some methods, ascribable to this approach, are based on the assumption t hat the 
prototypical descriptions are built by interacting with an expert of the domain ||9il0l ; 
the inadequacy of human knowledge to find a set of prototypes really representative 
of a given class significantly increases the risk of errors, especially in domains 
containing either many data or many different classes. 

More automati c meth ods are those facing the problem as a symbolic machine 
learning problem |1 1|12] , so formulated: “given a suitably chosen set of input data, 
whose class is known, and possibly some background domain knowledge, find out a 
set of optimal prototypical descriptions for each class”. A formal enunciation of the 
problem and a more detailed discussion to related issues will be given in the next 
section. Dietterich and Michalski provide an extensive review of thi s field , 
populated by methods which mainly differ in the adopted formalism 1 1 1|13] , 
sometimes more general than that implied by the graphs. 

The advantage making this approach really effective relies in the obtained 
descriptions which are explicit. Moreover, the property of being maximally general 
makes them very compact, i.e. containing only the minimum information for covering 
all the samples of a same class and for preserving the distinction between objects of 
different classes, as required for understanding the features driving the classification 
task. Due to these properties, the user can easily acquire knowledge about the domain 
by looking at the prototypes generated by the system, which appear simple, 
understandable and effective. Consequently he can validate or improve the prototypes 
or understand what has gone wrong in case of classification errors. 

In this paper we propose a novel algorithm for learning structured descriptions, 
ascribable to the category of symbolic techniques. It faces the problem directly in the 
space of the graphs and obtains general and coherent prototypes with a low 
computational cost with respect to other symbolic learning systems. The proposed 
algorithm is tested with reference to a problem of handwritten character recognition 
from a standard database. 



2. Rationale of the Method 

The rationale of our approach is that of considering descriptions given in terms of 
Attributed Relational Graphs and devising a method which, inspired to basic machine 
learning methodologies, particularizes the inference operations to the case of graphs. 
To this aim, we first introduce a new kind of Attributed Relational Graph, devoted to 
represent prototypes of a set of ARG’s. These graphs are called Generalized 
Attributed Relational Graphs (GARG’s) as they contain generalized nodes, edges, and 
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attributes. Then, we formulate a learning algorithm which builds such prototypes by 
means of a set of operations directly defined on graphs. The algorithm preserves the 
generality of the prototypes generated by classical mach ine learn ing algorithms. 
Moreover, similarly to most of machine learning systems mn], the prototypes 
obtained by our system are consistent, i.e. each sample is covered by only one 
prototype. 

We assume that the objects are described in terms of Attributed Relational Graphs 
(ARG). An ARG can be defined as a 6-tuple where N and 

£ c A X A are respectively the sets of the nodes and the edges of the ARG, and 
the sets of nodes and edge attributes and finally and the functions which 
associate to each node or edge of the graph the corresponding attributes. 

We will suppose that the attributes of a node or an edge are expressed in the form 
t(pi,...,pi^ ) , where t is a type chosen over a finite alphabet T of possible types and 

(Pi,...,Pi^ ) are a tuple of parameters, also from finite sets . Both the 

number of parameters (k^, the arity associated to type t) and the sets they belongs to 
depend on the type of the attribute; for some type may be equal to zero, so meaning 
that the corresponding attribute has no parameters. It is worth noting that the 
introduction of the type permits to differentiate the description of different kinds of 
nodes (or edges); in this way, each parameter associated to a node (or an edge) 
assumes a meaning depending on the type of the node itself. For example, we could 
use the nodes to represent different parts of an object, by associating a node type to 
each kind of part (see fig. 1). 

A GARG is used for representing a prototype of a set of ARG’s; in order to allow a 
prototype to match a set of possibly different ARG’s (the samples covered by the 
considered prototype) we extend the attribute definition. First of all, the set of types of 
node and edge attribute is extended with the special type <j), carrying no parameter and 
allowed to match any attribute type, ignoring the attribute parameters. For the other 
attribute types, if the sample has a parameter whose value is within the set Fj, the 

corresponding parameter of the prototype belongs to the set F- ' — ) , where 

p{X) is the power set of X, i.e. the set of all the subsets of X. Referring to the 

example of the geometric objects in Fig.l, a node of the prototype could have the 
attribute rectangle{{s,m},{m}), meaning a rectangle whose width is small or medium 
and whose height is medium. 

We say that a GARG G* = {N* ,E* ,a*^) covers a sample G and use 
the notation G* [= G (the symbol |= denotes the relation here on called covering) iff 

there is a mapping jU : N — > A such that: 

(1) jU is monomorphism, and 

(2) the attributes of the nodes and of the edges of G are compatible with the 
corresponding ones of G. 

The compatibility relation, denoted with the symbol >- is formally introduced as: 

Vt, 0yt(pi,...,pi^ ) and Vt, t(p*,...,pl)>-t(pi,...,pi^^)<^piep*A...Api^^ e pl^ 



( 3 ) 
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Condition (1) requires that each primitive and each relation in the prototype must 
he present also in the sample. This allows the prototype to specify only the features 
which are strictly required for discriminating among the various classes, neglecting 
the irrelevant ones. Condition (2) constrains the monomorphism required by condition 
(1) to be consistent with the attributes of the prototype and of the sample, by means of 
the compatibility relation defined in (3): this latter simply states that the type of the 
attribute of the prototype must be either equal to ^ or to the type of the corresponding 
attribute of the sample; in the latter case all the parameters of the attribute, which are 
actually sets of values, must contain the value of the corresponding parameter of the 
sample. 

Another important relation that will be introduced is specialization (denoted by the 
symbol < ): a prototype Gj* is said to be a specialization of gI iff: 

VG, G* ^ G ^ Gj ^ G (4) 

In other words, a prototype Gj* is a specialization of if every sample covered 
by G[ is also covered by G^ . Hence, a more specialized prototype imposes stricter 




o 



NODE TYPE ALPHABET 

T= { triangle , rectangle , circle ) 

l^triangle 2 krectangh 2 kcircle 1 

p mangle _ _ j ^ ^ large ) s {s,m,l] 

^rumgie _ _ j ggyiaU ^ rnedium , large ) s {s,m,l} 

(b) precurngk _ — j ggy^all , medium , large } s {s,m,l} 

^^reaangie _ jjgjgjjj _ j ^ medium , large } = {s,m,l} 
p circle _ _ j gffiaii ^ rnedium , large ) s {s,m,l} 

EDGE TYPE ALPHABET 

T= { onjop , left ) 

l^onjop — h l^lefi ~ h 



circle(m) 

? 

on_top 



(c) 



rectangle(l,s) 

R 

-^op/ \on_top rectangle(m,s 

O O ^ ^ 

rectangle(s,l) rectangle(sj) 

rectangle(s,m) rectangle(s,m) 



e(m,s) 

i_top / \ onjtop 



6 



(d) 



on_top 



o 

rectangle{s or m or I, s ot m or /) 



Fig. 1. (a) Objects made of three parts: circles, triangles and rectangles, (b) Tbe description 
scheme introduces three types of nodes, each associated to a part. Each type contains a set of 
parameters suitable to describe each part. Similarly edges of the graph, describing topological 
relations among the parts, are associated to two different types, (c) The graphs corresponding to 
the objects in a) and (d) a GARG representing the two ARG’s in c). Informally the GARG 
represents “any object made of a part on the top of a rectangle of any width and height”. 
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requirements on the samples to be covered. In fig. Id a prototype covering some 
objects is reported. 



3. The Proposed Learning Algorithm 

The goal of the learning algorithm can be stated as follows: there is a (possibly 
infinite) set S of all the patterns that may occur, partitioned into C different classes 
with S* niS* -0 for i ^ j the algorithm is given a finite subset 

S (Z S (training set) of labeled patterns (S = u . . . u 5^; with - S S* ), from 
which it tries to find a sequence of prototype graphs Gj* , G 2 , . . . , G* , each labeled with 
a class identifier, such that: 

VGe5 3i:G, [=G (completeness) (5) 

\/GgS* G* ^ G ^ class(G) = class(G* ) (consistency) (6) 

where class(G) and class(G ) refer to the class associated with samples G and G*. 

Of course, this is an ideal goal since only a finite subset of S is available to the 
algorithm; in practice the algorithm can only demonstrate that completeness and 
consistency hold for the samples in S. On the other hand, eq. (5) dictates that, in order 
to get as close as possible to the ideal case, the prototypes generated should be able to 
model also samples not found in S, that is they must be more general than the 
enumeration of the samples in the training set. However, they should not be too 
general otherwise eq. (6) will not be satisfied. The achievement of the optimal 
trade-off between completeness and consistency makes the prototypation a really hard 
problem. 

A sketch of the learning algorithm is presented in Fig. 2; the algorithm starts with 
an empty list L of prototypes, and tries to cover the training set by successively 
adding consistent prototypes. When a new prototype is found, the samples covered by 
it are eliminated and the process continues on the remaining samples of the training 
set. The algorithm fails if no consistent prototype covering the remaining samples can 
be found. It is worth noting that the test of consistency in the algorithm actually 
checks whether the prototype is almost consistent, i.e. almost all the samples covered 

by G belongs to the same class: 

k(G*)| 

Consistent(G ) <?4> max n r - ^ C^) 

/ \S(G*)\ 

where S(G )denotes the sets of all the samples of the training set covered by a 
prototype G*, 5,(G ) the samples of the class i covered by G*, and •ha suitably chosen 
threshold, close to 1. Also notice that the attribution of a prototype to a class is 
performed after building the prototype. 
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FUNCTION Learn(S) // Returns the ordered list of prototypes 
L :=[]// L is the list of prototypes, initially empty 
WHip 5^0 

G := FindPrototype{S) 

IF NOT Consistent{G ) THEN FAIL// The algorithm ends unsuccessfully 
// Assign the prototype to the class most represented 
class(G ) ;= argmax, |5,(G )| 

L := Append(L, G ) // Add G to the end ofL 
S := S-S(G )// Remove the covered samples from S 
END WHILE 

RETURN L 

Fig. 2. A sketch of the learning procedure. 

According to (7) the algorithm would consider consistent a prototype if more than 
a fraction d of the covered training samples belong to a same class, avoiding a further 
specialization of this prototype that could be detrimental for its generality. 

The most important part of the algorithm is the FindPrototype procedure, 
illustrated in fig. 3. It performs the construction of a prototype, starting from a trivial 
prototype with one node whose attribute is ^ (i.e. a prototype which covers any non- 
empty graph), and refining it by successive specializations until either it becomes 
consistent or it covers no samples at all. An important step of the FindPrototype 
procedure is the construction of a set Q of specializations of the tentative prototype G*. 
The adopted definition of the heuristic function FI, guiding the search of the current 
optimal prototype, will be examined later. 



FUNCTION FindPrototypeiS) 

// The initial tentative prototype is the trivial one, made of one node with attr. ^ 

G := TrivialPrototype // The current prototype 
WHILE |S(G*)| > 0 AND NOT Consistent{G*) 

Q := Specialize(G ) 

G := avgmaxxBQH(S, X) // H is the heuristic function 
END WHILE 

RETURN G* 

Fig. 3. The function FindPrototype 

At each step, the algorithm tries to refine the current prototype definition, in order 
to make it more consistent, by replacing the tentative prototype with one of its 
specializations. To accomplish this task we have defined a set of specialization 

>K 4: 

operators which, given a prototype graph G , produce a new prototype G such that 
G < G . The considered specialization operators are: 

1 . Node addition: G is augmented with a new node n whose attribute is 

2. Edge addition: a new edge {n*^,n* 2 ) is added to the edges of G , where and 

n *2 are nodes of G and G does not contain already an edge between them. The 
edge attribute is <j). 
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3. Attribute specialization: the attribute of a node or an edge is specialized according 
to the following rule: 

• If the attribute is <j), then a type t is chosen and the attribute is replaced with 

) . This means that only the type is fixed, while the type 
parameters can match any value of the corresponding type. 

• Else, the attribute takes the form t{p\,...,p*^. ), where each p* is a (non 

necessarily proper) subset of P! . One of the p* such that p* > 1 is replaced 



with pj - [Pi ] , with pj G Pi . In other words, one of the possible values of a 
parameters is excluded from the prototype. 

The heuristic function H is introduced for evaluating how promising the 
provisional prototype is. It is based on the estimation of the consistency and 
completeness of the prototype (see eq. 5 and 6). To evaluate the consistency degree of 
a provisional prototype G*, we have used an entropy based measure: 






S:\ 



5 ,. 



* V ^ /■ 



z 



\Si(G ) \Si(G ) 



S(G ) 



-log 2 



S(G ) 



( 8 ) 



H is defined so that the larger is the value of H^g^^{S,G ) , the more consistent is 
G ; hence the use of will drive the algorithm towards consistent prototypes. 

The completeness of a provisional prototype is taken into account by a second term 
of the heuristic function, which simply counts the number of samples covered by the 
prototype: 

//,„„pi(5,G*) = |5(G*)| (9) 

This term is introduced in order to privilege general prototypes with respect to 
prototypes which, albeit consistent, cover only a small number of samp les. 

The heuristic function used in our algorithm is the one described in J13] : 

H(5, G* ) = (5, G* (5, G*) (10) 



4. Experimental Results 

In order to test the learning ability of the proposed method, preliminary tests have 
been carried out on digits belonging to the NIST database 19 [ p^ . 

Before describing them in terms of ARG’s , ch aracters have been decomposed in 
circular arcs by using the method presented in [ |15j . 

Fig. 4 illustrates the adopted description scheme in terms of ARG’s; basically, we 
have two node types respectively used for representing the segments and their 
junctions, while the edges represent the adjacency of a segment to a junction. Node 
attributes are used to encode the size of the segments (normalized with respect to the 
size of the whole character) and their orientation; the edge attributes represent the 
relative position of a junction with respect to the segments it connects. 
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Tn= 1 stroke , junction } 

kstrake 3 
kjunction 0 

p^siroke _ _ j ^ short , medium , large , very_large } = {vs,s,m,l,vl} 

p^troke _ _ j gif-Qighi ^ light_bent , bent , highly _bent , circle } = {s,lb,b,hb,c] 

p^siroke _ Q]-jgf|]-^(jQjj - [ ^ fm; ^ w , sw , s , SC , c , ne } 

Tg = { connection } 

kcannection — 2 

p connection _ j^i.pj-gjection = { leftmost , vertical , rightmost } = { / , v , r } 
pfonnection _ y.pcojection = { bottom , horizontal , above } = [ b , h , a } 

Fig. 4. The alphabet of the types defined for the decomposition of the characters in circular 
arcs. 





(b) 



connection(r,a) 

o- 



connection(r,a) 
(l,s,nw) 

connection(l,b) 



(vl,s,w) 

’connection{v,b) 

) 

^OMnectioM(v,h) 



(vlAn) 

connection{l,a) 

(c) 



Fig. 5. (a) The bitmap of a character, (b) its decomposition in terms of circular arcs; (c) its 
representation in terms of ARG’s. 



In Fig. 5 an example of such a representation for a digit of the NIST database is 
shown. 

No effort has been made to eliminate the noise from the training set: 
experimentally it has been observed that the chosen structural description scheme 
provides to the prototyper several isomorphic graphs belonging to different classes. 

Some tests have been made by varying the composition of the training set and the 
consistency threshold; here only the results obtained by imposing a minimum 
consistency of the 90% are shown, as higher values are harmful for the generalization 
ability of the method. In particular, we present three different experiments: in the first 
one the training set is made up of 50 samples per class, randomly chosen by the 
database; in the second one 100 samples per class have been selected; while in the 
third one 200 samples per class. 

From Fig. 6, it can be noted that the training sets are affected by noise: in fact, the 
recognition rates obtained on these sets range from 94% to 96%. 
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Fig. 6. Misclassification tables on the training set as a function of its size: fa) 500 samples, (b) 
1000 samples, (c) 2000 samples. The R column indicates the percentage of rejected samples. 



However, the results obtained by the learning system, without taking into account 
the noise introduced by the description phase, are encouraging: the learning times for 
the above described tests were of 7, 12 e 22 minutes respectively, on a 600MHz 
Pentium III PC equipped with 128 MB of RAM. It is worth noting that multiplying by 
2 the number of training set samples, the processing time grows by a factor slightly 
less than 2. This fact suggests that the better representativeness of the input data 
allows the generation of prototypes less affected by noise. 

This hypothesis is confirmed by the total number of prototypes produced by the 
system: starting from 500 samples, 83 prototypes have been generated; with 1000 
samples 1 19 prototypes have been built and with a training set of 2000 samples the 
number of generated prototypes was 203. 
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In other words, as the size of the training set grows, the performance of the system 
becomes better both in terms of generalization ability and of learning times. 



5. Conclusions 

In this paper we have presented a novel method for learning structural descriptions 
from examples, based on a formulation of the learning problem in terms of Attributed 
Relational Graphs. Our method, like learning methods based on first-order logic, 
produces general prototypes which are easy to understand and to manipulate, but it is 
based on simpler operations (graph editing and graph matching) leading to a smaller 
overall computational cost. A preliminary experimentation has been conducted, which 
seems to confirm our claims about the advantages of our method. 
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Abstract. We report on a method for achieving a significant trunca- 
tion of the training space necessary for recognizing rigid 3D objects 
from perspective images. Considering objects lying on a table, the 
configuration space of continuous coordinates is three-dimensional. In 
addition the objects have a few distinct support modes. We show that 
recognition using a stationary camera can be carried out by training 
each object class and support mode in a two-dimensional configuration 
space. We have developed a transformation used during recognition for 
projecting the image information into the truncated configuration space 
of the training. The new concept gives full flexibility concerning the 
position of the camera since perspective effects are treated exactly. The 
concept has been tested using 2D object silhouettes as image property 
and central moments as image descriptors. High recognition speed and 
robust performance are obtained. 

Keywords: Computer vision for flexible grasping, recognition of 3D ob- 
jects, pose estimation of rigid object, recognition from perspective im- 
ages, robot-vision systems 



1 Introduction 

We describe here a method suitable for computer-vision-based flexible grasping 
by robots. We consider situations where classes of objects with known shapes, 
but unknown position and orientation are to be classified and grasped in a struc- 
tured way. Such systems has many quality measures such as recognition speed, 
accuracy of the pose estimation, low complexity of training, free choice of camera 
position, generality of object shapes, ability to recognize occluded objects, and 
robustness. We shall evaluate the properties of the present method in terms of 
these quality parameters. 

Recognition of 3D objects has been widely based on establishing correspon- 
dences between 2D features and the corresponding features on the 3D object 
[1-3]. The features has been point like, straight lines or curved image elements. 
The subsequent use of geometric invariants makes it possible to classify and pose 
estimate objects [4-8]. Another strategy is the analysis of silhouettes. When per- 
spective effects can be ignored, as when the objects are flat and the camera is 
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remote, numerous well established methods can be employed in the search for 
match between descriptors of recorded silhouette and those of silhouettes in a 
data base [9-11]. Other method are based on stereo vision or structured light [1, 
11 - 12 ], 

In the present paper we face the following situation: 

• The rigid objects do not have visual features suitable for 2D-3D correspon- 
dence search or 2D-2D correspondence search used in stereo vision. 

• No structured light is employed. 

• The camera is not necessarily remote, so that we must take perspective 
effects into account. 

We propose here a ’brute force’ method [13] in which a large number of images 
or image descriptors are recorded or constructed during training. Classification 
and pose estimation is then based on a match search using the training data 
base. A reduction of the configuration space of the training data base is desirable 
since it gives a simpler training process and smaller extent of the data bases. The 
novelty of the present method is the recognition based on training in a truncated 
configuration space. 

The method is based on a nonlinear 2D transformation of the image to be 
recognized. The transformation corresponds to a virtual displacement of the ob- 
ject into an already trained position relative to the camera. As relevant for many 
applications we consider objects lying at rest on a table or conveyer belt. In Sect. 
2 we describe the 3D geometry of the system. We introduce the concept, ’virtual 
displacement’, and define the truncated training space. In Sect. 3 are described 
the relevant 2D transformation and the match search leading to the classifica- 
tion and pose estimation. We also specify our choice of descriptors and match 
criterion in the recognition. The method has been implemented by constructing 
a physical training setup and by developing the necessary software for training 
results and recognition. Typical data of the setup and the objects tested are 
given in Sect. 4. We also present a few representative test results. 

The work does not touch upon the 2D segmentation on which the present 
method must rely. The 2D segmentation is known to be a severe bottleneck if 
the scene illumination and relative positions of the objects are inappropriate. 
In the test we used back-lighting and nonoccluded objects in order to avoid 
such problems. Therefore we can not evaluate the system properties in case of 
complex 2D segmentation. 



2 The 3D Geometry and the Truncated Training Space 

In the present work we consider a selection of physical objects placed on a 
horizontal table or a conveyer belt, see Fig.l. The plane of the table surface is 
denoted tt. We consider gravity and assume object structures having a discrete 
number of ways - here called support modes - on which the surface points touch 
the table. This means that we exclude objects, which are able to roll with a 
constant height of the center-of-mass. Let i count the object classes and let j 
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count the support modes. With fixed j, each object’s pose has three degrees of 
freedom, e.g. (x,y,uj), where (x,y) is the projection on the table plane tt of its 
center-of-mass (assuming uniform mass density), and u is the rotation angle of 
the object about a vertical axis through the center-of-mass. 

A scene with one or more objects placed on the table is viewed by an ideal 
camera with focal length /, pin hole position at a distance h above the table, 
and an optical axis making an angle a with the normal to the plane tt, see Fig. 
1. The origin of {x,y) is i/’s projection O on tt, and the y-axis is the projection 
of the optical axis on tt. Let (x, y, z) be the coordinates of a reference point of 
the object. We introduce the angle (j) defined by: 



cos 4> = 



y 

y/x"^ + y'^' 



sin (j) = 



X 

\/x‘^ + y‘^ 



( 1 ) 



Consider a virtual displacement of the object so that its new position is given 
by: 



(x', y',J) = (0, uj-cj)) 



(2) 



This displacement is a rotation about a vertical axis through the pinhole. Note 
that the same same points on the object surface are visible from the pin hole 
H in the original and displaced position. The inverse transformation to be used 
later is given by: 

X = y' sin 4>, y = y' cos (j), ui = to' + 4> (3) 



The essential property of this transformation is that the corresponding 2D 
transformation is independent of the structure of the object. The truncation of 
the training space introduced in the present paper is based on this property. 




Fig. 1. Horizontal (A) and vertical (B) views of the system including the table, the 
camera, and the object before and after the virtual displacement. 



We focus on image properties condensed in binary silhouettes. Therefore, we 
assume that the scene is arranged with a distinguishable background color so 
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that each object forms a well defined silhouette fi{i, j,x,y,u}) on the camera 
image. Thus, Q{i, j,x,y,uj) is a list of coordinates (u,u) of set pixels in the 
image. We assume throughout that {u,v) = (0,0) is lying on the optical axis. 
The task in the present project is to determine (i,j, x,y,oj) from a measurement 
of an object’s silhouette l7o and a subsequent comparison with the silhouettes 
Q{i,j,x = Q,y,u>) recorded or constructed in a reduced configuration space. In 
the data base the variables y and oj are suitably discretized. Silhouettes for the 
data base are either recorded by a camera using physical objects or constructed 
from a CAD representation. 



3 The 2D Transformation and the Match Search 

After the above mentioned virtual 3D displacement, an image point (m, v) of the 
object will have the image coordinates (u',v') given by: 

,, , f (u cos 4> + V sin (j) cos a — f sin (j) sin a) 

u (<p,u,v) = 2 

Msin (/>sinQ; -I- u(l — coscp) sin a cos a -I- /(cos 0 sin a + cos^ a) 

( 4 ) 



, /(— usin^cosa -I- u(cos />cos^ a -I- sin^ a) -I- /(I — cos (()) sinacos a) 

V {(p,U,V) = 2 

■usin (/)sinQ; -I- u(l — coscp) sin a cos a -I- /(cos(()sin a + cos^ a) 

( 5 ) 

This result can be derived by considering - in stead of an object displacement - 
three camera rotations about H: A tilt of angle —a, a roll of angle (j), and a tilt 
of angle a. Then the relative object-camera-position is the same as if the object 
were displaced according to the 3D transformation described in Sect. 2. Note 
that the inverse transformation corresponds to a sign change of (j). 

By transforming all points in a silhouette f2 according to (4-5), one obtains 
a new silhouette 17'. Let us denote this silhouette transformation so that 

n' = T^{Q) (6) 

The 2D center-of-mass of 17 is (ucm(l7), Ucm(l7)). The center-of-mass of the 
displaced silhouette 17' is close to the transformed center-of-mass of the original 
silhouette 17. In other words 



^cm (i7 ) ~ U (0, 

•) '^cm m ( 7 ) 

^cm (17 ) ~ V ((/), 

1 ’^cm m) ( 8 ) 

This holds only approximately because the 2D transformation in (4-5) is 
nonlinear. In Fig 2 is shown a situation in which 17' is a square. The black 

dot in 17' is the transformed center-of-mass of 17, which is displaced slightly 

from the center-of-mass of 17'. 
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Fig. 2. An image component Q and the transformed version O' in case that fi' is a 
square. The values of a and (j> are given in the upper right corner. The black dot 17 
is the 2D center-of.mass. After transformation this point has a position shown as the 
black dot in 17'. 



Let f2tr = ^2tr{i, j,y,oS) be the silhouettes of the training with x = 0. The 
training data base consist of descriptors of y, w) with suitably discretized 

y and to. In case of not too complex objects, 

~ 0 (9) 

The object to be recognized has the silhouette l7o- This silhouette defines an 
angle </>o given by 



4>o = arctan( 



/ sin a — Vcmi^o) cos a 



(10) 



According to (4, 7, 10), the transformed silhouette 17), = has a 2D 

center-of-mass close to u = 0: 



uc ™( f ?(,)~0 ( 11 ) 

We shall return to the approximations (7-9,11) later. 

Eqs. (9) and (11) imply that 17' is to be found among the silhouettes 
■l^ir(*) j) y, w) of the data base. Because of the approximations (9,11), the simi- 
larity between 17), and 17(j.(i,j, y,a;) is not exact with regards to translation, so 
one must use translational invariant descriptors in the comparison. 

In the search for match between ntr{i,j,y,oj) and 17(, it is convenient to 
use that Wcm(l7o) ~ Vcmi.!^'tri.hhy,^))- It turns out, that Ucm(l7(r(b j, J/,w)) is 
usually a monotonous function of y, so - using interpolation - one can calculate a 
data base slice with a specified value Vcm and with i,j,ui as entries. This means 
that z, j, and ui can be determined by a match search between moments of 17(, 
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and moments in this data base slice. Note that the data base slice involves one 
continuous variable w. With a typical step size of 3° the data base slice has only 
120 records per support mode and per object class. 

The result of the search are tmatch, jmatch, and Wmatch- The value ?/match can 
be calculated using the relation between y and Vcm(^tr) for the relevant values of 
i, j, and to. The original pose (x, y, to) can now be found by inserting y' = j/match, 

= Wmatch, and (/) = </>o in Eq. (3). 

If the approximation (9) brakes down, one must transform all the silhouettes 
of the data base, so that the match search takes place between and = 
where 



^tr 



arctan( 



f sin a — Vcmi^tr) COS a 



(12) 



In this case (f) = (j)o — 4>tr should be inserted in (3) in stead of (j>o- 

We are left with the approximation (11), demonstrated in Fig. 2. This gives 
a slightly wrong angle 4>o- If the corresponding errors are harmful in the pose 
estimation, then one must perform an iterative calculation of </>o, so that (11) 
holds exactly. 

We conclude this section by specifying our choice of 1) image descriptors 
used in the data base, and 2) recognition criterion. In our test we have used as 
descriptors the 8-12 lowest order central moments, namely /xoo, /T 2 o, Mii. Mo 2 > /^ 30 > 
M 21 ) /i 2 i, and /io 3 . The first order moments are absent since we use translational 
invarianat moments. In addition we used in some of the tests, the width and 
height of the silhouette. In the recognition strategy we minimized the Euclidean 
distance in a feature space of descriptors. The descriptors were normalized in 
such a way that the global variance of each descriptor was equal to one [14]. 



4 The Experiments 

Fig. 3 shows the setup for training and test. We used a rotating table for scanning 
through the training parameter w and a linear displacement of the camera for 
scanning through the parameter y. The pose estimation was checked by a grasp- 
ing robot. In order to avoid 2D segmentation problems we used backlighting in 
both training and recognition. 

The parameters for the setup and typical objects tested are shown in the 
Table 1. 

We report here the result of a test of a single object, a toy rabbit manu- 
factured by LEGO®, see Fig 4. Its 5 support modes are shown along with the 
support mode index used. After training using an angular step size of Alu = 4°, 
we placed the rabbit with one particular support mode 100 random poses, i .e. 
values of (x, y, uj) in the field of view. The support modes detected by the vision 
system were recorded. This test was repeated for the remaining support modes. 
The results are shown in the Table 2. The two confused support modes were 
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Camera 



Robot 



System for linear camera displacement 



Rotatable 
light table/ 
with object 



Fig. 3. Setup for training and test. The calibration template is used for calibrating the 
camera relative to a global coordinate system. 



Table 1. Properties and parameters of the objects and the test setup. 



Angle a 


25° 


Height h 


800 mm 


Field of view 


400 mm x 300 mm 


Alj =angular step size during training 


4°- 7.2° 


Ay =translational step size during training 


50 mm 


Camera resolution (pixels) 


768 X 576 


Number of support modes of objects 


3-5 


Typical linear object dimensions 


25-40 mm 


Typical linear dimensions of object images 


40-55 pixels 


Silhouette descriptors 


8 lowest order centr. moments 
-I- width & height of silhouette 


Number of data base records per object 


900-1500 for Auj = 7.2° 


Training time per support mode 


5 min. 


Recognition time (after 2D segmentation) 


5-10 ms 



’support by four paws’ and ’support by one ear and fore paws’. It can be under- 
stood from Fig. 4, that these two support modes are most likely to be mixed up 
in a pose estimation. We repeated the experiment with Aoj = 7.2°. In this case 
no errors were measured in the 500 tests. 
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0 

1 











Fig. 4. The toy rabbit shown in its five support modes. The support mode indices used 
in Table 2 are written in the upper left corner. 

Table 2. Statistics of the support mode detection when the toy rabit was placed at 
random positions {x, y, uj) in the field of view. Each support mode was tested 100 times 
and the experiments involved two different angular step sizes in the training. 



Angular step size 


1 7.2° 1 


1 4° 1 


True j,, Detected — >■ 


0 


1 


2 


3 


4 


0 


1 


2 


3 


4 


0 standing 


100 










100 










1 lying on left side 




100 










100 








2 lying on right side 






100 










100 






3 on fore & hind paws 








98 


2 








100 




4 on ear & fore paws 








1 


99 










100 



5 Discussion 

A complete vision system for flexible grasping consists of two processes, one per- 
forming the 2D segmentation, and one giving the 3D interpretation. We have 
developed a method to be used in the second component only, since we used a 
illumination and object configuration giving very simple and robust segmenta- 
tion. 

The method developed is attractive with respect to the following aspects: 

• High speed of the 3D interpretation. 

• Generality concerning the object shape. 

• Flexibility of camera position and object shapes, since tall objects, closely 
positioned cameras, and oblique viewing directions are allowed. In case of 
ambiguity in the pose estimation when viewed by a single camera, it is easy 
to use 2 or more cameras with independent 3D interpretation. 

• Simple and fast training without assistance from vision experts. 

The robustness and total recognition speed depends critically on the 2D 
segmentation, and so we can not conclude on these two quality parameters. The 
method in its present form is not suitable for occluded objects. 

One remaining property to be discussed is the accuracy of the pose esti- 
mation. In our test the grasping uncertainty was about +/-2 mm and +/- 3°. 
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However, the origin of these uncertainties were not traced, so they may be re- 
duced significantly by careful camera-robot co-calibration. 

In our experiments we used a rather coarse descretization of y and w, and 
only one object at a time was recognized. The recognition time in the experiment 
was typically 20 ms per object (plus segmentation time). This short processing 
times gives plenty of room for more demanding tasks involving more objects, 
more support modes, and higher accuracy through a finer discretizaion of y and 

U!. 

6 Conclusion 

We have developed and tested a computer vision concept appropriate in a brute 
force method based on data bases of image descriptors. We have shown that a 
significant reduction of the continuous degrees of freedom necessary in the train- 
ing can be achieved by applying a suitable 2D transformation during recognition 
prior to the match search. The advantages are the reductions of the time and 
the storage used in the training process. 

The prototype developed will be used for studying a number of properties 
and possible improvements. First, various types of descriptors and classification 
strategies will be tested. Here, color and gray tone information should be in- 
cluded. Second, the over-all performance with different 2D segmentation strate- 
gies will be studied, particularly those allowing occluded objects. Finally, the 
concept of training space truncation should be extended to systems recognizing 
objects of arbitrary pose. 
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Abstract. In the framework of an evolutionary approach to machine 
learning, this paper presents the preliminary version of a learning system 
that uses Genetic Programming as a tool for automatically inferring the 
set of classihcation rules to be used by a hierarchical handwritten charac- 
ter recognition system. In this context, the aim of the learning system is 
that of producing a set of rules able to group character shapes, described 
by using structural features, into super-classes, each corresponding to 
one or more actual classes. In particular, the paper illustrates the struc- 
ture of the classification rules and the grammar used to generate them, 
the genetic operators devised to manipulate the set of rules and the fit- 
ness function used to match the current set of rules against the sample 
of the training set. The experimental results obtained by using a set of 
5,000 digits randomly extracted from the NIST database are eventually 
reported and discussed. 



1 Introduction 

The recognition of handwritten character involves identifying a correspondence 
between the pixels of the image representing the samples to be recognized and 
the abstract definitions of characters (models or prototypes). The prevalent ap- 
proach to solve the problem is that of implementing a sequential process, each 
stage of which progressively reduces the information passed to subsequent ones. 
This process, however, requires that decisions about the relevance of a piece of 
information have to be taken since the early stages, with the unpleasant conse- 
quence that a mistake may prevent the correct operation of the whole system. 
For this reason, the last stage of the process is usually by far the most complex 
one, in that it must collect and combine in rather complex ways many piece of 
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information that, all together, should be able to recover, at least partially, the 
loss of information introduced in the previous ones. Moreover, due to the extreme 
variability exhibited by samples produced by a large population of writers, pur- 
suing such an approach often requires the use of a large number of prototypes for 
each class, in order to capture the distinctive features of different writing styles. 
The combination of complex classification methods with large set of prototypes 
has a dramatic effect on the classifier: larger number of prototypes requires a 
finer disciminating power, which, in turn, requires more sophisticated methods 
and algorithms to perform the classification. While such complex methods are 
certainly useful for dealing with difficult cases, they are useless, and even dan- 
gerous, for simple ones; if the input sample happens to be almost noiseless, and 
its shape corresponds to the most “natural” human perception of a certain char- 
acter, invoking complex classification schemes on large sets of prototypes results 
in overloading the classifier without improving the accuracy. 

For this reason, we have investigated the possibility of using a preclassification 
technique whose main purpose is that of reducing the number of prototypes to 
be matched against a given sample while dealing with simple cases. The general 
problem of reducing a classifier computational cost has been faced since the 70’s 
in the framework of statistical pattern recognition PP and more recently within 
shape-based methods for character recognition m- The large majority of the 
preclassification methods for character recognition proposed in the literature 
belongs to one of two categories, depending on whether they differ in the set 
of features used to describe the samples or in the strategy adopted for labeling 
them Although they exhibit interesting performance, their main drawback is 
that a slight improvement in the performance results in a considerable increase 
of the computational costs. 

In the framework of an evolutionary approach to character recognition we are 
developing |S| , this paper reports a preliminary version of a learning system that 
uses Genetic Programming wa as a tool for producing a set of prototypes able 
to group character shapes, described by using structural features, into super- 
classes, each corresponding to one or more actual classes. The proposed tool 
works in two different modes. During an off-line unsupervised training phase, 
rather rough descriptions of the shape of the samples belonging to the training 
set are computed from the feature set, and then allowed to evolve according to 
the Genetic Programming paradigm in order to achieve the maximum coverage 
of the training set. Then, each prototype is matched and labeled with the classes 
whose samples are matched by that prototype. At run time, after the feature 
extraction, the same shape description is computed for a given sample. The 
labels of the simplest prototype matching the sample represent the classes the 
sample most likely belongs to. 

The paper is organized as follows: Section 2 describes the character shape 
description scheme and how it is reduced to a feature vector, which represent 
the description the Genetic Programming works on. In Section 3 we present 
our approach and its implementation, while Section 4 reports some preliminary 
experimental results and a few concluding remarks. 
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2 From Character Shape to Feature Vector 

In the framework of structural methods for pattern recognition, the most com- 
mon approach is based on the decomposition of an initial representation into 
elementary parts, each of which is simply describable. In this way a character is 
described by means of a structure made up by a set of parts interrelated by more 
or less complex links. Such a structure is then described in terms of a sentence 
of a language or of a relational graph. Accordingly, the classification, that is the 
assignment of a specimen to a class, is performed by parsing the sentence, so as 
to establish its accordance with a given grammar or by some graph matching 
techniques. The actual procedure we have adopted for decomposing and de- 
scribing the character shape is articulated into three main steps: skeletonization, 
deeomposition and description 0 . During the first step, the character skeleton is 
computed by means of a MAT-based algorithm, while in the following one it is 
decomposed in parts, each one corresponding to an arc of circle which we have 
selected as our basic elements. Eventually, each arc found within the character 
is described by the following features: 

— size of the arc, referred to the size of its bounding box; 

— angle spanned by the arc, 

— direction of the arc curvature, represented by the oriented direction of the 
normal to the chord subtended by the arc. 

The spatial relations among the arcs are computed with reference to arc projec- 
tions along both the horizontal and vertical axis of the character bounding box. 
In order to further reduce the variability still presents among samples belonging 
to the same class, the descriptions of both the arcs and the spatial relations are 
given in discrete form. In particular, we have assumed the following ranges of 
values: 

— size, small, medium, large; 

— span: closed, medium, wide; 

— direction: N, NE, E, SE, S, SW, W, NW; 

— relation: over, below, to-the-right, superimposed, included. 

Those descriptions are then encoded into a feature vector of 139 elements. The 
first 63 elements of the vector are used to count the occurrences of the different 
arcs that can be found within a character, the following 13 elements describe 
the set of possible relations and the remaining 63 ones count the number of the 
different relations that originates from each type of arc 0. 

3 Learning Explicit Classification Rules 

Evolutionary Learning Systems seem to offer an effective prototyping method- 
ology, as they are based on Evolutionary Algorithms (EAs) that represent a 
powerful tool for finding solutions in complex high dimensional search space, 
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when there is no a priori information about the distribution of the sample in the 
feature space cni. They perform a parallel and adaptive search by generating 
new populations of individuals and evaluating their fitness while interacting with 
the environment. Since EAs work by directly manipulating an encoded repre- 
sentation of the problem, and because such a representation can hide relevant 
information, thus severely limiting the chance of a successful search, problem rep- 
resentation is a key issue in EAs. In our case, as in many others, the most natural 
representation for a solution is a set of prototypes or rules, whose genotype’s size 
and shape are not known in advance, rather than a set of fixed-length strings. 
Since classification rules may be thought as computer programs, the most natural 
way to introduce them into our learning system is that of adopting the Genetic 
Programming paradigm 00 . Such a paradigm combines GAs and programming 
languages in order to evolve hierarchical computer programs of dynamically vary- 
ing complexity (size and shape) according to a given defined behavior. According 
to this paradigm, populations of computer programs are evolved by using the 
Darwin’s principle that evolution by natural selection occurs when a population 
of replicating entities possesses the heritability characteristic and are subject to 
genetic variation and struggle to survive. 

Typically, Genetic Programming starts with an initial population of ran- 
domly generated programs composed of functionals and terminals especially tai- 
lored to deal with the problem at hand. The performance of each program in 
the population is measured by means of a fitness function, whose nature also 
depends on the problem. After the fitness of each program has been evaluated, 
a new population is generated by selection, recombination and mutation of the 
current programs, and replaces the old one. Then, the whole process is repeated 
until a termination criterion is satisfied. 

In order to implement the Genetic Programming paradigm, the following 
steps has to be executed: 

— definition of the structures to be evolved; 

— choice of the fitness function; 

— definition of the genetic operators. 



3.1 Structure Definition 

In order to define the individual structures that undergo to adaptation in Genetic 
Programming, one needs a program generator, providing syntactically correct 
programs, and an interpreter for the structured computer programs, in order to 
execute them. 

The program generator is based on a grammar written for S-expressions. A 
grammar Q is a quadruple G = (T, V, 5, P), where T and V are disjoint finite 
alphabets. T is the terminal alphabet, V is the non-terminal alphabet, S is the 
start symbol, and V is the set of production rules used to define the strings 
belonging to our language, usually written as v ^ w where u is a string on 
(T U V) containing at least one non-terminal symbol, while w is an element of 
(T U V)*. For the problem at hand, the set of terminals is the following: 
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Table 1. The grammar for the random rules generator. 

Production Rule No. Production Rule Probability 

1 S To 

2 A — ^ CBC I {CBC) I {IMX) 0.25 | 0.25 | 0.5 

3 I — > ai I ... I ai 39 uniform 

9 uniform 

= I ^ I > uniform 

uniform 
uniform 



T — {ai, 02, . . . , 0139, 0 , 1 ,... , 9}, 
and the set V is composed as follows: 

V = {A, V, <, <, =, >, >, A, X, I, M, C, B}, 

where at is a variable atom denoting the *-th element in the feature vector, 
and the digits 0, 1, . . . ,9 are constant atoms used to represent the value of each 
element in the feature vector. It should be noted that the above sets satisfy the 
requirements of closure and sufficiency |B|. The adopted set of production rules 
is reported in Table E 

Each individual in the initial random population is generated starting with 
the symbol S that, according to the above grammar, can be replaced only by the 
symbol A. This last, on the contrary, can be replaced by itself, by its opposite, 
or by any recursive combination of logical operators whose arguments are the 
occurrences of the elements in the feature vector. It is worth noticing that, in 
order to avoid the generation of very long individual, the clause IMX has a 
higher probability of being selected to replace the symbol A with respect to the 
other ones that appear in the second production rule listed in Table Dl As it is 
obvious, the set of all the possible structures is the set of all possible compositions 
of functions that can be obtained recursively from the set of predefined functions. 

Finally, the interpreter is a simple model of a computer and is constituted 
by an automaton that computes Boolean functions, i.e. an acceptor. Such an 
automaton computes the truth value of the rules in the population with respect 
to a set of samples. 

3.2 Fitness Function 

The definitions reported in the previous subsection allow for the generation of 
the genotypes of the individuals. The next step to accomplish is the definition 
of a fitness function to measure the performance of the individuals. For this 
purpose, it should be noted that EAs suffer of a serious drawback while dealing 
with multi-peaked (multi-modal) fitness landscape, in that they are not able to 
deal with cases where the global solution is represented by different optima, i.e. 
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species, rather than by a single one. In fact, they do not allow the evolution of 
stable multiple species within the same population, because the fitness of each 
individual is evaluated independently on the composition of the population at 
any time CH. As a result, the task of the evolutionary process is reduced to that 
of optimizing the average value of the fitness in the population. Consequently, the 
solution provided by such a process consists entirely of genetic variations of the 
best individual, i.e. of a stable single species distribution. This behaviour is very 
appealing whenever it is desirable the uniform convergence to the global optimum 
exhibited by the canonical EAs. On the other hand, there exist many applications 
of interest that require the discovering and the maintenance of multiple optima, 
such as multi-modal optimization, inference induction, classification, machine 
learning, and biological, ecological, social and economic systems H2!. 

In order to evolve several stable distributions of solutions, each represent- 
ing a given species, we need a mechanism that prevents the distribution of the 
species with highest selective value to replace the competing species inducing 
some kind of restorative pressure in order to balance the convergence pressure 
of selection. To this end, several strategies for the competition and cooperation 
among the individuals have been introduced. The natural mechanism to handle 
such cooperation and competition is sharing the environmental resources, i.e. 
similar individuals share such resources thus inducing niching or speciation cn 



IB IE IE IB 



For the problem at hand, it is obvious that we deal with different species. In 
fact, our aim is to perform an unsupervised learning in such a way that the result 
of the evolution is the emergence of different rules (species) each of which covers 
different sets of samples. Thus, the global solution will be represented by the 
disjunctive-normal-form of the discovered species in the final population. More- 
over, in case of handwritten characters the situation gets an additional twist. 
In fact, it is indisputable that different writers may refer to different prototypes 
when drawing samples belonging to a given class. A classical example is that of 
the digit ‘7’ that is written with an horizontal stroke crossing the vertical one 
in many European countries, while such a stroke is usually not used in North 
American countries. 

Therefore, some kind of niching must be incorporated at different levels into 
the fitness function. In our case, we have adopted a niching mechanism based 
on resource sharing [I I 4| I 5j . According to resource sharing the cooperation and 
competition among the niches is obtained as follows: for each finite resource Si a 
subset P of prototypes from the current population is selected. They are let to 
compete for the resource Si in the training set. Only the best prototype receives 
the payoff. In the case of a tie, the resource Si is assigned randomly among all 
deserving individuals. The winner is detected by matching each individual in 
the sample P against the sample of the training set. In our case, a prototype 
p matches a sample s if and only if p covers s, i.e. the features present in the 
sample are represented also in the prototype and their occurrences satisfy the 
constraints expressed in the prototype. At the end of the cycle, the fitness of 
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each individual is computed by adding all rewards earned. In formula: 

m 

^(P) = n{p, Si), 

i=l 

where (j){p) is the the fitness of the prototype p, m is the number of samples in 
the training set, p{p, Si) is a function that takes into account if the prototype p 
is the winner for the sample Si (it is 1 if p is the winner for the sample Si and 0 
otherwise) and c is a scaling factor. 

3.3 Selection and Genetic Operators 

The selection mechanism is responsible for choosing among the individuals in 
the current population the ones that are replicated, without alterations, in the 
new population. To this aim many different selection mechanisms have been pro- 
posed. Nevertheless, we have to choose a mechanism that helps the maintenance 
of the discovered niches by reducing the selection pressure and noise fS]- So, 
we have used in our Genetic Programming the well known Stochastic Universal 
Selection mechanism fTTlj . according to which we have exactly the same expec- 
tation as Roulette Wheel selection, but lower variance in the expected number 
of individuals generated by the selection process. 

As regards the variations operators, we have actually used only the mutation 
operator, performing both micro- and macro-mutation. The macro-mutation 
is activated when the point to be mutated in the genotype is a node, and it 
substitutes the relative subtree with another one randomly generated according 
to the grammar described in subsection 3.1. It follows from the above that the 
macro-mutation is responsible for modifying the structure of the decision tree 
corresponding to each prototype in the same general way with respect to that 
implemented by the classical tree-crossover operator. For this reason we have 
chosen not to implement the tree-crossover operator. 

Eventually, the micro-mutation operator is applied whenever the symbol 
selected for mutation is either a terminal or a function, and it is responsible for 
changing both the type of the features and their occurrences in the prototype. 
Therefore, it resembles closely the classical point-mutation operator. 

Finally, we have allowed also insertions , i.e. a new random node is inserted in 
a random point, along with the relative subtree if it is necessary, and deletions, 
i.e. a node in the tree is selected and deleted. Obviously, such kind of mutations 
are effected in a way that ensures the syntactic correctness of the newly generated 
programs. 

4 Experimental Results and Conclusions 

A set of experiments has been performed in order to evaluate the ability of 
Genetic Programming to generate classification rules in very complex cases, like 
the one at hand, and the efficiency of the preclassifier. The experiments were 
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performed on a data set of 10,000 digits extracted from the NIST database and 
equally distributed among the 10 classes. This data set was randomly subdivided 
into a training and a test set, both including 5,000 samples. Each character 
was decomposed, described and eventually coded into a feature vector of 139 
elements, as reported in Section 2. Starting from those feature vectors, we have 
considered only 76 features, namely the 63 features describing the type of the 
arcs found in the specimen and the 13 used for coding the type of the spatial 
relations among the arcs, and used that as descriptions for the preclassifier to 
work with. 

It must be noted that the Genetic Programming paradigm is controlled by 
a set of parameters whose values affect in different ways the operation of the 
system. Therefore, before starting the experiments, suitable values should be 
assigned to the system parameters. To this purpose, we have divided this set 
of parameters into external parameters, that mainly affect the performance of 
the learning system and internal parameters, that are responsible for the effec- 
tiveness and efficiency of the search. As external parameters we have assumed 
the population size TV, the number of generations G and the maximum depth 
of the trees representing the rules in the initial population D. The internal pa- 
rameters are the mutation probability pm, the number of rules n competing for 
environmental resources in the resourse sharing and the maximum size of the 
subtree generated by the macro-mutation operator d. The results reported in 
the sequel has been obtained with N = 1000, G = 350, D = 10, Pm = 0.6, 
n = N and d = 3. Eventually, let us recall that, as mentioned in subsection 
3.2, the fitness function requires to select the winner among the prototypes that 
match a given sample. We have adopted the Occam’s razor principle of simplic- 
ity closely related to Kolmogorov Complexity definition daiHI, i.e. “ if there are 
alternative explanations for a phenomenon, then, all other things being equal, 
we should select the simplest one’'’’ . In the current system, it is implemented by 
choosing as winner the prototype whose genotype is the shortest one. With such 
a criterion we expect the learning to produce the simplest prototypes covering 
as many samples as possible, in accordance with the main goal of our work: 
to design a method for reducing the computational cost of the classifier while 
dealing with simple cases by reducing the number of classes to be searched for 
by the classifier. 

As mentioned before, the first experiment was aimed at evaluating the ca- 
pability of the Genetic Programming paradigm to deal with complex cases such 
as the one at hand. For this purpose, we have monitored the covering rate C, 
i.e. the percentage of the samples belonging to the training set covered by the 
set of prototypes produced by the system during the learning. Let us recall now 
that, in our case, a prototype p matches a sample s if and only if p covers s, 
i.e. the features present in the sample are represented also in the prototype and 
their occurrences satisfy the constraints. The learning was implemented by pro- 
viding the system with the descriptions of the samples without their labels, so 
as to implement un unsupervised learning, and let it evolve the set of prototypes 
for achieving the highest covering rate. Despite the loss of the information due 
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Table 2. The experimental results obtained on the Test Set. 



c 


n < 3 < 5 < 7 


£ < 3 < 5 < 7 


87.00 


67.16 74.18 24.72 1.10 


19.84 81.65 17.64 0.10 



to both the discrete values allowed for the features and the reduced number of 
features considered in the experiment, the system achieved a covering rate of 
91.38% In other words, there were 431 samples in the training set for which the 
system was unable to generate or to maintain a suitable set of prototypes with 
the time limit of 350 generations. 

In order to measure the efficiency of the classifier, we preliminarily labeled 
the prototypes obtained at the end of the learning phase. The labeling was such 
that each time a prototype matched a sample of the training set, the label of 
the sample, i.e. its class, was added to the list of labels for that prototype. At 
the end of the labeling, thus, each prototype had a list of labels of the classes 
it covered, as well as the number of samples matched in each class. A detailed 
analysis of these lists showed that there were many prototypes covering many 
samples of very few classes and very few samples of many other classes. This was 
due to “confusing” samples, i.e. samples belonging to different classes but having 
the same feature values, thus looking indistinguishable for the system. In order 
to reduce this effect, the labels corresponding to the classes whose number of 
samples covered by the prototype was smaller than 10% of the total number of 
matchings for that prototype were removed from the list. Finally, a classification 
experiment on the test set was performed, that yielded to the results reported 
in Table 0in terms of covering rate C, correct recognition rate TZ and error rate 

Such results were obtained by assuming that a sample was correctly preclas- 
sified if its label appeared within the list of labels associated to the prototype 
covering it. To emphasize the efficiency of the preclassifier. Table El reports the 
experimental results by grouping the prototypes depending on the number of 
classes they cover. For instance, the third column shows that 74.18% of the total 
number of samples correctly preclassified were covered by prototypes whose lists 
have at most 3 classes. 

The experimental results reported above allow for two concluding remarks. 
The first one is that Genetic Programming represents a promising tool for learn- 
ing classification rules in very complex and structured domains, as the one we 
have considered in our experiments. In particular, it is very appealing in develop- 
ing hierarchical handwritten character recognition system, since it may produce 
prototypes at different level of abstraction, depending on the way the system is 
trained. The second one is that the learning system developed by us, although 
makes use of a fraction of the information carried by the character shape, is 
higly efficient. Since the classification is performed by matching the unknown 
sample against the whole prototype set to determine the winner, reducing the 
number of classes reduces the number of prototype to consider, thus resulting in 
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an overall reduction of the classification time. Therefore, roughly speaking, the 

preclassifier proposed by us is able to reduce the classification time to 50% or 

less in more than 65% of the cases. 
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Abstract. In this paper we propose a new classifier - the Maximal Re- 
jection Classifier (MRC) - for target detection. Unlike pattern recog- 
nition, pattern detection problems require a separation between two 
classes, Target and Clutter, where the probability of the former is sub- 
stantially smaller, compared to that of the latter. The MRC is a linear 
classifier, based on successive rejection operations. Each rejection is per- 
formed using a projection followed by thresholding. The projection vector 
is designed to maximize the number of rejected Clutter inputs. It is shown 
that it also minimizes the expected number of operations until detection. 
An application of detecting frontal faces in images is demonstrated using 
the MRC with encouraging results. 



1 Introduction 

In target detection applications, the goal is to detect occurrences of a specific 
Target in a given signal. In general, the target is subjected to some particular 
type of transformation, hence we have a set of target signals to be detected. In 
this context, the set of non- Target samples are referred to as Clutter. In practice, 
the target detection problem can be characterized as designing a classifier C{z), 
which, given an input vector z, has to decide whether z belongs to the Target 
class X or the Clutter class Y. In example based classification, this classifier is 
designed using two training sets - X = {a^i}i=i..Lx {Target samples) and Y = 
{yi}i=i..Ly {Clutter samples), drawn from the above two classes. 

Since the classifier C{z) is usually the heart of a detection algorithm, and is 
applied many times, simplifying it translates immediately to an efficient detec- 
tion algorithm. Various types of example-based classifiers are suggested in the 
literature j I . The most simple and fast are the linear classifiers, where the 
projection of z is performed onto a projection vector u, thus, C{z) = f{u*z) 
where /(*) is a thresholding operation (or some other decision rule). The Sup- 
port Vector Machine (SVM) |2| and the Fisher Linear Discriminant (FLD) JQ 
are two examples of linear classifiers. In both cases the kernel u is chosen in some 
optimal manner. In the FLD, u is chosen such that the Mahalanobis distance 
of the two classes after projection will be maximized. In the SVM approach the 
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motive is similar, but the vector u is chosen such that it maximizes the margin 
between the two sets. 

In both these classifiers, it is assumed that the two classes have equal impor- 
tance. However, in typical target detection applications the above assumption 
is not valid since the a-priori probability of z belonging to X is substantially 
smaller, compared to that of belonging to Y. Both the FLD and the SVM do 
not exploit this property. Moreover, in both of these methods, it is assumed that 
the classes are linearly separable. However, in a typical detection scenario the 
target class is surrounded by the clutter class, thus the classes are not linearly 
seperable (see, e.g. Figure 2). In order to be able to treat more complex, and un- 
fortunately, more common scenarios, non-linear extensions of these algorithms 
are required m- Such extensions are typically at the expense of much more 
computationally intensive algorithms. 

In this paper we propose the Maximal Rejection Classifier (MRC) that over- 
comes the above two drawbacks. While maintaining the simplicity of a linear 
classifier, it can also deal with non linearly separable cases. The only require- 
ment is that the Clutter class and the convex hull of the Target class are disjoint. 
We define this property as two convexly-separable classes, which is a much weaker 
condition compared to linear-separability. In addition, this classifier exploits the 
property of high Clutter probability. Hence, it attempts to give very fast Clutter 
labeling, even if at the expense of slow Target labeling. Thus, the entire input 
signal is classified very fast. 

The MRC is an iterative rejection based classification algorithm. The main 
idea is to apply at each iteration a linear projection followed by a thresholding, 
similar to the SVM and the FLD. However, as opposed to these two methods, the 
projection vector and the corresponding thresholds are chosen such that at each 
iteration the MRC attempts to maximize the number of rejected CZittter samples. 
This means that following the first classification iteration, many of the Clutter 
samples are already classified as such, and discarded from further consideration. 
The process is continued with the remaining Clutter samples, again searching for 
a linear projection vector and thresholds that maximizes the rejection of Clutter 
samples from the remaining set. This process is repeated iteratively until a small 
number or non of the Clutter samples remain. The remaining samples at the final 
stage are considered as Targets. The idea of rejection-based classifier was already 
introduced by |^. However, in this work we extend the idea by using the concept 
of maximal rejection. 



2 The MRC in Theory 



Assume two classes are given in 5R", X (the Target class) and Y (the Clutter 
class). It is required to discriminate between these two classes, i.e., given an 
input 0 drawn from one of these classes, we would like to be able to label it 
correctly as either Target or Clutter. One important point, however, is that we 
know a-priori that for a typical input stream the vast majority of the inputs are 
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Clutters, i.e.: 

P{X} « P{Y} (1) 

where ^’{X} is the a-priori probability that an input signal will be a Target, and 
P{Y} is defined similarly. Based on this knowledge, we would like the classifier 
to give a decision as fast as possible (i.e., with as few operations as possible). 
Thus, Clutter labeling should be performed fast, even if at the expense of slow 
Target labeling. 

Similar to other linear classifiers m, we suggest to first project the sample 
z onto a vector u, and label it based on the projected value a = u’^ z. Projecting 
the Target class and the Clutter class onto u results with a Probability Density 
Functions (PDF) P{a|X} and P{a|Y} respectively. We define the following 
intervals based on P{o|X} and P{a|Y}: 



C* = {a|P{a|X}>0,P{a|Y} = 0} 
a = {a|P{a|X} = 0,P{a|Y}>0} (2) 

= {a|P{a|X} > 0, P{a|Y} > 0} 

(t- Target, c-Clutter and w-Unknown). After projection, z is labeled either as a 
Target, Clutter, or Unknown, based on the interval at which a belongs to. 

Unknown classifications are obtained only in the Cu interval, where a deci- 
sion cannot be made. Figure 1 presents an example for the construction of the 
intervals Ct, Cc and C„ and their appropriate decisions. The probability of the 
Unknown decision is given by: 



P{Unknown}= f P{Y}P{a\Y}da + [ P{X}P{a\X}da (3) 

Jaec„ J aec,^ 

The above term is a function of the projection vector u. We would like to find 
the vector u which minimizes the “Unknown” probability. However, since this is 
a complex minimization problem, an alternative minimization is developed here, 
using a proximity measure between the two PDF’s. 

If P{a|Y} and P{a|X} are far apart and separated from each other 
P {Unknown} will be small. Therefore, an alternative requirement is to min- 
imize the overlap between these two PDF’s. We will define this requirement 
using the following expected distance between a point Oq and a distribution 
P{a}: 



P(ao||P{a}) 



(oo - a)^P{a} 



da = 



(oo - m) 2 -h (T^ 



where fi is the mean of P{a} and cr is the variance of P{a}. The division by 
a is performed in order to make the distance scale-invariant (or unit-invariant). 
Using this distance definition, the distance of P{a|X} from P{o:|Y} can be 
defined as the expected distance of P(a|Y) from P{a|X}: 
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Fig. 1. The intervals Ct, Cc and Cu, for sped c PDFs P{a|X} and P{a|Y}. 



D{P{a\Y} II P{a|X}) = / D{a' || P{a|X}) P{a'\Y}da' = 



( 4 ) 



{a' - + cr2 



P{a'\Y}da' = 






where and [/ry,cTy] are the mean-variance pairs of P{a|X} and P{a|Y}, 

respectively. Since we want the two distributions to have as small an overlap 
as possible, we would like to maximize this distance or minimize the proximity 
between P{a|Y} and P{a|X}, which can be defined as the inverse of their 
mutual distance. Note, that this measure is asymmetric with respect to the 
two distributions, i.e the proximity defines the closeness of P{a|Y} to P{a|X}, 
but not vice versa. Therefore, we define the overall proximity between the two 
distributions as follows: 



Prox (P{a|Y},P{a|X}) = 



( 5 ) 



= P{X} 



o-x + + (My - 



P{Y} 



c^x + + (My - 



Compared to the original expression in Equation 0 the minimization of this 
term with respect to u is easier. If P{X} = P{Y}, i.e. if there is an even chance 
to obtain Target or Clutter inputs, the proximity becomes: 






0"x + o"y + (My - Mx)2 



Prox{P{a\Y} , P{a|X}) = 



( 6 ) 



518 



M. Elad, Y. Hel-Or, and R. Keshet 



which is associated with the cost function minimized by the Fisher Linear Dis- 
criminant (FLD)P. In our case -P{X} <C -P{Y} (Equation P), thus, the first 
term is negligible in Equation and can be omitted. Therefore, the optimal u 
should minimize the resulting term: 



d{u) 



+ id-y - fJ-xY 



(7) 



where ay,a'^,Hy and fix are all a function of the projection vector u. 

There are two factors that control d{u). The first factor is the distance be- 
tween the two means fiy and fix- Maximizing this distance will minimize d(u). 
However, this factor is negligible when the two means are close to each other. 
This scenario is typical in detection cases when the target class in surrounded 
by the clutter class (see Figure 2). The other factor is the ratio between ax and 
ay. Our aim is to find a projection direction which results in a small ax and 
large ay. This means that the projection of Target inputs tend to concentrate in 
a narrow interval, whereas the Clutter inputs will spread with a large variance 
(see e.g. Fig 2). 

For the optimal u, most of the Clutter inputs will be projected onto Cc, 
while Ct might even be an empty set. Subsequently, after projection, many of 
the Clutter inputs are usually classified, whereas Target labeling may not be 
immediately possible. This serves our purpose because there is a high probability 
that a decision will be made when a Clutter input is given. Since these inputs are 
more frequent, this means a faster decision for the vast majority of the inputs. 

The method which we suggest follows this scheme: The classifier works in 
an iterative manner, projecting and thresholding with different parameters at 
each iteration sequentially. Since the classifier is asymmetric, the classification 
is based on rejections] Clutter inputs are classified and removed from further 
consideration while the remaining inputs are kept as suspected Targets. The 
iterations and the rejection approaches are both key concepts of the proposed 
scheme. 



3 The MRC in Practice 



Let us return to Equation 0 and find the optimal projection vector u. In order 
to do so, we have to express (7^, cr^, fiy and fix as functions of u. It is easy to see 
that: 



= Mr and 



= Rr 



( 8 ) 



where we define: 



Mr = / zP{z\X} dz 



Rx 



= f{z-Mx){z-MxfP{z\X}dz (9) 

J Z 
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In a similar manner we express /iy and cr^. As can be seen, only the first and 
second moments of the classes play a role in the choice of the projection vector 
u. 

In practice we usually do not have the probabilities P{ 2 :|X}, P{z|Y}, and 
inference on the Target or Clutter class is achieved through examples. For the 
two example-sets X = and Y = the mean-covariance pairs 

(Ma:, My, and Ryy) are replaced with empirical approximations: 

Jj 'r J^'T 

k=l k=l 

and similarly for My and Ryy. The function we aim to minimize is therefore: 



d{u) = 






Ra; 



R„ 



(My-M,) (My-M,)" 



( 11 ) 



Similarly to PEEI, it is easy to show that u that minimizes the above expression 
satisfies: 



Ra:a:^ — ^ 



Ra: 



R« 







U 



( 12 ) 



and should correspond to the smallest possible A. A problem of the form Au = 
XBu, as in Equation [Q is known as the generalized eigenvalue problem [I I I4l5j . 
and has a closed form solution. Notice that given any solution u for this equation, 
(3u is also a solution with the same A. Therefore, without loss of generality, the 
normalized solution ||u|| = 1 is used. 

After finding the optimal projection vector u, the intervals Ct,Cc, and Cu 
can be determined according to Equation |2J An input z is labeled as a Target 
or Clutter if its projected value u^z is in Ct or Cc, respectively. Figure 2 (left) 
presents this stage for the case where Ct is empty, i.e. there are no inputs which 
can be classified as Target. 

Input vectors whose projected values are in C„ are not labeled. For these 
inputs we apply another step of classification, where the design of the optimal 
projection vector in this step is performed according to the following new distri- 
butions: 

P{z\Y kulz&Cu} and P{z|X & wf z € 

We define the next projection vector U 2 as the vector which minimizes the prox- 
imity measure between the above two distributions. This minimization is per- 
formed in the same manner as described for the first step. Figure 2-right presents 
the second rejection stage, which follows the first stage shown in Figure 2-left. 

Following the second step, the process continues similarly with projection 
vectors U 3 ,rt 4 ,---, etc. Due to the optimality of the projection vector at each 
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Fig. 2. Left: The first rejection stage for a 2D example. Right: The second rejection 
stage. 



step, it is expected that a large portion of the input vectors will be labeled as 
Clutter at each step, while following steps will deal with the remaining input 
vectors. Applying the cascade of classifiers in such an iterative manner ensures a 
good performance of the classification with respect to an accurate labeling and 
a fast classification rate. 

Since we exchanged the class probabilities with sets of points, it is imprac- 
tical to define the intervals Ct,Cc, and C„ using Equation 0 This is because 
the intervals will be composed of many fragments each of which results from a 
particular example. Moreover, the domain of a cannot be covered by a finite set 
of examples. Therefore, it is more natural to define for each set, two thresholds 
bounding its projection values. As explained above, due to the functional that 
we are minimizing, in typical detection cases the Target thresholds define a small 
interval located inside the CZwtter interval (see Figure 2). Therefore for simplic- 
ity, we define, for each projection vector, only a single interval T = [Ti,T 2 ], 
which is the interval bounding the Target set. After projection we classify points 
projected outside T as Clutter and points projected inside T as Unknown. 

In the case where the Target class forms a convex set, and the two classes are 
disjoint, it is theoretically possible to completely discriminate between them. 
This property is easily shown by noticing that we are actually extracting the 
Target set from the Clutter set by a sequence of two parallel hyper-planes, cor- 
responding to the two thresholding operations. This constructs a parallelogram 
that bounds the Target set from outside. Since any convex set can be constructed 
by a set of parallel hyper-planes, exact classification is possible. However, if the 
Target set is non-convex, or the two classes are non-convexly separable (as de- 
fined in the Introduction), it is impossible to achieve a classification with zero 
errors; Clutters inputs which are inside the convex hull of the Target set cannot 
be rejected. Overcoming this limitation can be accomplished by a non-linear ex- 
tension of the MRC, which is outside the scope of this paper. In practice, even if 
we deal with a convex Target set, false-alarms may exist due to the sub-optimal 
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approach we are using, which neglects multi-dimensional moments higher than 
the second. However, simulations demonstrate that the number of false-alarms 
is typically small. 



4 Face Detection Using the MRC 



The face detection problem can be specified as the need to detect all instances 
of faces in a given image, at all spatial positions, all scales, all facial expressions, 
all poses, of all people, and under all lighting conditions. All these requirements 
should be met, while having few or no false alarms and mis-detections, and with 
as fast an algorithm as possible. This description reveals the complexity of the 
detection problem at hand. As opposed to other pattern detection problems, 
faces are expected to appear with considerable variations, even for the detection 
of frontal and vertical faces only. Variations are expected because of changes in 
skin color, facial hair, glasses, face shape, and more. 

Several papers already addressed the face detection problem using various 
methods, such as SVM I2EI, Neural Networks , and other methods mam 
I I 211 ;tlj . In all of these studies, the above complete list of requirements is relaxed 
in order to obtain practical detection algorithms. Following [ttilTIDII If 1 5] . we deal 
with the detection of frontal and vertical faces only. 

In all these algorithms, spatial position and scale are treated through the 
same method, in which the given image is decomposed into a Gaussian pyramid 
with near-unity (e.g., 1.2) resolution ratio. The search for faces is performed 
in each resolution layer independently, thus enabling the treatment of different 
scales. In order to be able to detect faces at all spatial positions, fixed sized 
blocks of pixels are extracted from the image at all positions (with full or partial 
overlap) for testing. In addition to the pyramid part, which treats varying scales 
and spatial positions, the core part of the detection algorithm is essentially a 
classifier which provides a Face/Non-Face decision for each input block. 

We demonstrate the application of the MRC for this task. In the face- 
detection application. Faces take the role of targets, and Non-Faces are the 
clutter. The MRC produces very fast Non-Face labeling at the expense of slow 
Face labeling. Thus, on the average, it has a short decision time per input block. 

The first stage in the MRC is to gather two example-sets. Faces and Non- 
Faces. As mentioned earlier, large enough sets are needed in order to guarantee 
good generalization for the faces and the non-faces that may be encountered in 
images. As to the Face set, the ORL data-base0was used. This database contains 
400 frontal and vertical face images of 40 different individuals. By extracting the 
face portion from each of these images and scaling to 15 x 15 pixels, we obtained 
the set X = (with = 400). The Non-Face set is required to be much 

larger, in order to represent the variability of Non- Face patterns in images. For 
this purpose we have collected from images with no faces more than 20 million 
Non-Face examples. 



http://www.cam-orl.co.uk/facedatabase.html: ORL database web-site 
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5 Results 

We trained the MRC for detecting faces by computing 50 sets of kernels 
and associated thresholds using the above described databases of 

Faces and Non-Faces. Figures 3 and 4 show three examples of the obtained re- 
sults. In these examples, the first stage rejected close to 90% of the candidates. 



Fig. 4. Two examples for face detection with the MRC 



This stage is merely a convolution of the input image (at every scale) with the 
first kernel, followed by thresholding. Successive kernels yield further rejection 
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at about 50% for each projection. Thus, the complete MRC classification re- 
quired an effective number of close to two convolutions per each pixel in each 
resolution layer. As can be seen from the examples, the MRC approach per- 
formed very well and was able to detect most of the existing faces. There are 
few false alarms, which typically correspond to blocks of pixels having a pattern 
which may resemble a face. In addition mis-detection occurs when a face is par- 
tially occluded or rotated too much. Generally speaking, the algorithm performs 
very well in terms of detection rate, false alarm rate, and most important of 
all, computational complexity. Due to space limitation we do not include more 
technical details in this paper. Comprehensive description of the results as well 
as comparative study with other face detection algorithms can be found in m. 

6 Conclusion 

In this paper we presented a new classifier for target detection, which discrim- 
inates between Target and Clutter classes. The proposed classifier exploits the 
fact that the probability of a given input to belong to the Target class is much 
lower, as compared to its probability to belong to the Clutter class. This as- 
sumption, which is valid in many pattern detection applications, is exploited 
in designing an optimal classifier that detects Target signals as fast as possi- 
ble. Moreover, exact classification is possible when the Target and the Clutter 
classes are convexly separable. The Fisher Linear Discriminant (FLD) is a spe- 
cial case of the proposed framework when the Target and Clutter probabilities 
are equal. In addition, the proposed scheme overcomes the instabilities arising in 
the FLD in cases where the mean of the two classes are close to each other. An 
improvement of the proposed technique is possible by rejecting Target patterns 
instead of Clutter patterns in advanced stages, when the probability of Clutter 
is not larger anymore. The performance of the MRC is demonstrated in the face 
detection problem. The obtained face detection algorithm is shown to be both 
computationally very efficient and accurate. Further details on the theory of the 
MRC and its application to face detection can be found in j 1 51 1 4) . 
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Abstract. An experiment was conducted to explore the transfer of training 
between visual lobe measurement tasks and visual search tasks. The study 
demonstrated that lobe practice improved the visual lobe, which in turn resulted 
in improved visual search performance. The implication is that visual lobe 
practice on carefully chosen targets can provide an effective training strategy in 
visual search and inspection. Results obtained from this research will help us in 
devising superior strategies for a whole range of tasks that have a visual search 
component (e.g., industrial inspection, military target acquisition). Use of these 
strategies, will ultimately lead to superior search performance. 



1 Introduction 

Visual search of extended fields has received continuing interest from researchers [1], 
[2], [3], [4]. It constitutes a component of several practically important tasks such as 
industrial inspection and military target acquisition [5], [6]. Because of its continuing 
importance in such tasks, training strategies are required which will rapidly and 
effectively improve search performance. The research presented here is a study 
focused on study of the most fundamental parameter, in visual search performance, 
the visual lobe size. Visual search has been modeled as a series of fixations, during 
which visual information is extracted from an area around the central fixation point 
referred to as the visual lobe [7], [8]. The visual lobe describes the decrease in target 
detection probability with increasing eccentricity angle from the line of sight [9] . The 
importance of the visual lobe as a determinant of visual search performance has been 
illustrated by Johnston [10] who showed that the size of the visual lobe had a 
significant effect on search performance. Subjects who had larger visual lobes 
exhibited shorter search times. Such a relationship between visual lobe and search 
performance has been confirmed by Erickson [11], Leachtenauer [12] and Courtney 
and Shou [13]. The concept of visual lobe is central to all mathematical approaches 
of extended visual search [14], [15], [16]. For a given search area the search time is 
dependent on the following [5]: the lobe size, the fixation time and the search 
strategy. The visual lobe area is affected by such factors as the adaptation level of the 
eye, target characteristics such as target size and target embeddedness, background 
characteristics, motivation, individual differences in peripheral vision and individual 
experience [17]. 
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Studies in visual search have shown that both speed and accuracy can be improved 
with controlled practice [18], [14]. Training has consistently shown to be an effective 
way for improving various facets of visual inspection performance where the 
inspection task has a major search component [19], [20], [21], [22]. There is evidence 
to suggest that the visual lobe is amenable to training and best depicted by a 
negatively accelerated learning function [9], [12]. In a major field study of the role of 
photo-interpreter performance, Leachtenauer [12] found an increase in visual lobe 
size with training and a correlation between lobe size and interpreter search 
performance. 

Thus drawing from the literature on visual lobe and visual search the question that 
needs answering is whether it is possible to train subjects to improve their visual lobe. 
Moreover does an improved visual lobe translate into improved search performance? 
To answer the above questions the following experiment was devised. 



2 Experiment - Effects of Lobe Practice on Search Performance 

2.1 Methodology 

The experiment had as its objective the test of a transfer of training hypothesis. The 
objective was to determine whether lobe practice improved search performance. A 
criterion task was presented in the Pre-Training phase. Following this, a Training 
phase of practice was given. Finally the criterion task was presented once again in the 
Post-Training phase. In the Training phase, separate groups were given different 
training interventions, with subjects randomly assigned to the groups. These training 
intervention groups are characterized as Groups (G), while the criterion tasks are 
referred to as Trials (T). 



2.2 Subjects 

All subjects were undergraduate and graduate student volunteers, aged 20-30 years 
and were paid $ 5.00 per hour to participate. All were tested to ensure at least 20/20 
vision, when wearing their normal corrective prescriptions if appropriate. Subjects in 
each experiment were briefed and given access to the criterion and training tasks to 
familiarize them with the tasks and computer interfaces. The eighteen subjects were 
randomly assigned to two groups, a Lobe Practice Group and a Control Group. 

Lobe Practice Group . Practice on five trial blocks of the VL task, taking about 30 
min for a total of 700 screens. 

Search Practice Group . Practice on five trial blocks of the VS task, taking about 
30 min for a total of 250 screens. 



2.3 Equipment 

The task was performed on an IBM PS2 (Model 70). Stimulus material consisted of 
search field of size 80 characters wide by 23 characters deep. The search field 
contained the following background characters: % ! | H- - ( ) \ which were randomly 
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located on the screen; with background character density of 10 % (i.e., only 10% of 
the spaces on the search field were filled, the rest of the search field being filled by 
blanks). The character ‘&’ served as the target (Figure 1). The characters on the 
screen were viewed at 0.5 meters. The entire screen subtended an angle of 33.4° 
horizontally. The density of characters around the target was controlled to prevent any 
embedding affect, as in the task used by Czaja et al. [20] and Gallwey et al. [23]. 
Monk et al. [24] has shown that the number of characters in the eight positions 
adjacent to the target has a large effect on the performance in a search task. Hence this 
was kept constant at two out of eight positions for all screens. 
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Fig. 1. Screen for experiment 



2.4 Description of the VS and VL Tasks 

The objective of the VL task was to estimate the horizontal extent of the visual lobe 
by determining how far into the periphery the subjects could see the target in a single 
eye fixation. In the VL task fixation cross was presented for 3s, followed by a VL test 
screen for 0.3s, and finally a repeat of the cross to prevent post-stimulus scanning. 
The purpose of the cross was to help the subjects follow their instructions to fixate at 
the central fixation point after each viewing of the VL screen. The target would 
appear randomly at any one of the six equally spaced, predetermined locations on the 
horizontal center line, with three positions on either side of the central fixation point. 
The subject’s task was to locate the single target in VL test screen, which could only 
occur along the horizontal axis of the fixation cross. With such a brief duration of 
exposure, only a single fixation would be possible. Subjects using binocular vision 
indicated that they had found a target to the left of the fixation cross by pressing the 
key "Q" on the keyboard or the "P" key if the target was found on the right. If no 
target could be found, the space bar was pressed. No prior information concerning the 
position of the targets at the six predetermined positions was provided to the subjects 
before exposure to the VL screen. For equal number of screens of each VL 
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experiment, the target was not present at any of the six target positions to discourage 
any guessing strategy. 

In the visual search (VS) tasks, the subject’s objective was to locate a target by self- 
paced search of the whole screen. The background characters and density were same 
as for the VL task. The entire screen was divided into four quadrants by the cross. 
When each VS screen appeared, timing began using the computers timing function, 
timing ended with the subject pressing one of four keys (L, A, X or M) appropriate to 
the quadrant in which the target was found. 



2.5 Stimulus Materials 

The stimulus materials used in this experiment comprised of the VL and VS task 
wherein each Trial block of the VL task comprised 360 screens with targets and 60 
screens without targets. The criterion task were both the VS task, with a single Trial 
block of 150 screens and a single Trial block of the VL task administered before and 
after training. The training intervention comprised five trial blocks of the VL task. 



2.6 Measurement and Analysis 

The measurement in the VL tasks consisted of determining the probabilities of 
detection at each target distance (in degrees of visual angle) from the fixation cross. 
Left and right occurrences of the target were combined. On each trial block for each 
subject, the probabilities of detection were used to derive a single measure of visual 
lobe size. This was defined as the visual lobe horizontal radius. The probability of 
detection is plotted as a function of angular position of the target for two trial blocks 
to give one half of a visual lobe horizontal cross section. As the edges of the visual 
lobe have been found to be the form of a cumulative normal distribution [25], [26], 
the z values corresponding to each probability were regressed onto angular target 
position. The angular position giving a predicted z value of 0.00, corresponding to a 
detection probability of 0.50 was used as the measure of visual lobe size in the visual 
search tasks, search time, in seconds per screen, was measured. 



3 Results 

3.1 Effects of Training Intervention on the Criterion Tasks 

With both VL and VS as criterion tasks, two analyses were undertaken. Visual lobes 
were calculated as before, and a mixed model ANOVA conducted on lobe sizes. 
There was no Group effect but a large Trial effect (F(l,16) = 71 1.7, p < 0.0001) and a 
large Group x Trial interaction (F(l,16) = 10.6, p < 0.01). An ANOVA conducted on 
the Pre-training Trial showed no group effect. For the Control group, the visual lobe 
horizontal radius increased after the training intervention from 10.1° to 11.52° while 
for the lobe practice group the change was from 9.13° to 13.06°. A mixed model 
ANOVA of the mean search times for each subject showed significant Trial (F(l, 16) 
= 166.1, p < 0.001) and the Group x Trial interaction (F(l,16) = 7.57, p < 0.001) 
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effects. There was no difference between the groups on the Pre-training Trial. Mean 
search times decreased for the control group from 6.4 to 5.3 seconds, and for the lobe 
practice group from 6.4 to 4.5 seconds. Clearly, there was a good transfer from a lobe 
practice task to a search task as evidenced from the correlation between lobe size and 
visual search performance. 



3.2 Effects During Practice 

For the five practice trial blocks, a mixed model ANOVA on the lobe sizes showed a 
significant effect of Trials (F(4,32) = 8.310, p < 0.001) for the lobe practice group. 
Figure 2 plots this data, with the criterion lobe size trials also included. 



Visual Lobe Radius, degrees 




Total screens 

Fig. 2. Learning graph for lobe size 



3.3 Search Time/Lobe Size Correlation 

Both the correlations before practice (r = 0.725) and after practice (r = 0.888) were 
significant at p < 0.001. 
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4 Discussion and Conclusion 

The experiment has reinforced the intimate relationship between visual lobe size and 
visual search performance, and extended the relationship to cover practice and 
transfer effects. Experiment showed the transfer effect, showing that lobe practice did 
indeed improve performance on a criterion search task. Additionally, the effect of 
lobe practice on lobe size was confirmed. The lobe practice effects on lobe size, found 
by Leachtenauer [12] for photo interpreters, were confirmed here by the experiment, 
and extended to the effects of search practice on lobe size. However, the slope of the 
learning function leveled off. This suggests that considerable practice is required to 
realize the full benefits of learning, so that directly improving a searcher’s visual lobe 
size by practice at a lobe measurement task can be a practical training strategy. Given 
that lobe size and visual search performance were found to be highly correlated in 
these experiments, and that lobe practice transfers to search performance, this should 
be a powerful means of training in a difficult visual task such as inspection. 
Leachtenauer [12] had previously found an insignificant training effect on lobe size 
but a significant effect on search performance, using a training program aimed at a 
technique for expansion of the visual lobe. From our data, it appears that simple 
practice on the lobe measurement task is an effective training strategy. 

The experiment confirmed that lobe task practice does improve performance on the 
criterion visual search task, showing this to be a viable and simple method for 
improving visual search performance. With the ability to construct computer-based 
simulators (such as the one used here) comes the ability to use digitally scanned 
images to develop a large library of faults, and incorporate these directly into a 
training program. Training can thus achieve a higher degree of realism and generality 
than was used here using available technology. Thus, many of the benefits of using 
computers in the knowledge-based and rule-based aspects of aircraft maintenance 
function (e.g. Johnson [27] ) can be brought to the vital but more skill based 
functioning of visual inspection. From the results of this study the following 
conclusion can be drawn: Practice on a visual lobe detection task does transfer to a 
visual search task. 
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Abstract. We consider the problem of recognizing an object from its 
silhouette. We focus on the case in which the camera translates, and 
rotates about a known axis parallel to the image, such as when a mo- 
bile robot explores an environment. In this case we present an algorithm 
for determining whether a new silhouette could come from the same ob- 
ject that produced two previously seen silhouettes. In a basic case, when 
cross-sections of each silhouette are single line segments, we can check 
for consistency between three silhouettes using linear programming. This 
provides the basis for methods that handle more complex cases. We show 
many experiments that demonstrate the performance of these methods 
when there is noise, some deviation from the assumptions of the algo- 
rithms, and partial occlusion. Previous work has addressed the problem 
of precisely reconstructing an object using many silhouettes taken under 
controlled conditions. Our work shows that recognition can be performed 
without complete reconstruction, so that a small number of images can 
be used, with viewpoints that are only partly constrained. 



1 Introduction 

This paper shows how to tell whether a new silhouette could come from the 
same object as previously seen ones. We consider the case in which an object 
rotates about a single, known axis parallel to the viewing plane, and is viewed 
with scaled orthographic projection. This is an interesting subcase of general 
viewing conditions. It is what happens when a person or robot stands upright 
as it explores a scene, so that the eye or camera is directed parallel to the floor. 
It is also the case when an object rests on a rotating turntable, with the camera 
axis parallel to the turntable. 

It is easy to show that given any two silhouettes, and any two viewpoints, 
there is always an object that could have produced both silhouettes. So we sup- 
pose that two silhouettes of an object have been obtained to model it, and ask 
whether a third silhouette is consistent with these two images. We first charac- 
terize the constraint that two silhouettes place on an object’s shape, and show 
that even when the amount of rotation between the silhouettes is unknown, this 
constraint can be determined up to an affine transformation. Next we show that 
for silhouettes in which every horizontal cross-section is one line segment, the 
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question of whether a new silhouette is consistent with these two can be reduced 
to a linear program. Linear programming can also be used to test a necessary, 
but not sufficient condition for arbitrary silhouettes. We provide additional al- 
gorithms for silhouettes with cross-sections consisting of multiple line segments. 
We describe a number of experiments with these algorithms. 

Much prior work has focused on using silhouettes to determine the 3D struc- 
ture of an object. Some work uses a single silhouette. Strong prior assumptions 
are needed to make reconstruction possible in this case (eg., HHEi]). A second 
approach is to collect a large number of silhouettes, from known viewpoints, and 
use them to reconstruct a 3D object using differential methods (eg., u, U2I, 
nn ) or volume intersection (eg., m, m). These methods can produce accurate 
approximations to 3D shape, although interestingly, Laurentini|7| shows that 
exact reconstruction of even very simple polyhedra may require an unbounded 
number of images. Our current work makes quite different assumptions. We con- 
sider using a small number of silhouettes obtained from unknown viewpoints 
and ask whether the set of prior images and the new image are consistent with 
a single 3D shape without reconstructing a specific shape. 

2 Constraints from Two Silhouettes 

Let p,q,r denote the boundaries of three silhouettes. Let P,Q,R denote the 
filled regions of the silhouettes. When rotation is about the y axis, there will 
be two 3D points that appear in every silhouette, the points with highest and 
lowest y values. Denote the image of these points on the three silhouettes as: 
9 ij < 72 , ri, r 2 . Let M denote the actual 3D object. 





Fig. 1. Left: Two silhouettes with bottom points aligned. Middle: The y = i plane. 
Right: Rectangular constraints project to a new image. 



Given two silhouettes, p and q, we can always construct an object that can 
produce p and g, with a method based on volume intersection (eg., P]). We 
may assume, without loss of generality (WLOG), that rotation is about the 
y-axis (when we consider three silhouettes this assumption results in a loss of 
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generality). Also, WLOG assume that the silhouettes are transformed in the 
plane so that pi = qi = (0, 0) (see Figure El left), and the tangent to p and q at 
that point is the x axis. Assume the silhouettes are scaled so that p 2 and q 2 have 
the same y value. Assume also WLOG that M is positioned so that the point on 
M that projects to p\ is placed at (0,0,0). Moreover, we can assume that M is 
projected without scaling or translation. That is, in this setup, we can assume 
the object is projected orthographically to produce p, then rotated about the y 
axis, and projected orthographically to produce q. 

If we cut through P and Q with a single horizontal line, y = i,we get two line 
segments, called Pi and Qi. Denote the end points of Pi by {Pi^mim i), {Pi, max, *)• 
These line segments are projections of a slice through M, where it intersects the 
plane y = i. Gall this slice Mi. The segment Pi constrains Mi. In particular, in 
addition to lying in the y = i plane, all points in Mi must have Pi, min ^ x < 
Pi,max, and there must be points on Mi for which Pi, min = x and for which 
X = Pi,max (and there must be points on Mi that take on every intermediate 
value of x). We get constraints of this form for every i. Any model that meets 
these constraints will produce a silhouette p. 

Now, suppose that Q has been produced after rotating M by some angle, 9 
(see Figured middle). The constraints that Q places on M have the same form. 
In particular. Pi and Qi provide the only information that constrains Mi. Howe- 
ver, the constraints Qi places are rotated by an angle 9 relative to the constraints 
of Pi. Therefore, together, they constrain Mi to lie inside a parallelogram, and 
to touch all of its sides. Therefore, we can create an object that produces both 
silhouettes simply by constructing these parallelograms, then constructing an 
object that satisfies the constraints they produce. 

We denote the entire set of constraints that we get from these two images 
by Cg. We now prove that it is not important to know 9, because the set of 
constraints that we derive by assuming different values of 9 are closely related. 
Let denote the constraints we get by assuming that 9 = ^. Then 

/ 1 0 0 \ 

Lemma 1. TgCg = where: Tg = I 0 1 0 1 

y — cos 9 Osin 9 j 

That is, Cg consists of a set of parallelograms, and applying Tg to these produces 
the set of rectangles that make up . 

We omit the proof of this lemma for lack of space. This shows that Cg and 
CiL are related by an affine transformation. Since the affine transformations form 
a group, this implies that without knowing 9 we determine the constraints up 
to an affine transformation. This is related to prior results showing that affine 
structure of point sets can be determined from two images (0). However, our 
results are quite different, since they refer to silhouettes in which different sets 
of points generate the silhouette in each image, and our proof is quite different. 
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3 Comparing Three Silhouettes 



3.1 Silhouettes with Simple Cross-Sections 



Now we assume that the third silhouette r is generated by again rotating M 
about the same axis. This problem is easier than the case of general rotations, 
because each of the parallelograms constraining M project to a line segment. For 
any other direction of rotation, they project to parallelograms. To simplify nota- 
tion we refer to Cj as C. These constraints are directly determined by assuming 
that p constrains the x coordinates of M, and q constrains the 2 ; coordinates. 
The true constraints, C^, depend on the true angle of rotation between the first 
two images, which is not known. 

We can translate and scale r so that the y coordinates of the tops and bottoms 
of the silhouettes are aligned. This accounts for all scaling, and all translation 
in y. We may then write the transformation that generates r in the form: 



T^M = 




cos (j) 0 sin (j) ' 
0 1 0 
— sin (/) 0 cos (j) 



M- 



which expresses x translation of the object, rotation about the y axis by 0, 
and then orthographic projection. We now examine how this transformation 
projects the vertices of the constraining parallelograms into the image. As we 
will see, the locations of these projected vertices constrain the new silhouette 
r. Since C = TgCg, we have Tg^C = Cg. Therefore, the projection of the true 
constraints, = T(f,Tg^C, or: 



T^Ce 



cos ffl — 



!) cos 9 



sin (p 
sin 9 

0 



c- 



We will abbreviate this as: ® ^ ~ ^™sinT ^ ^ “ 

a, b and t are unknowns, while the constraints, C and the new silhouette r 
are known. Our goal is to see if a, b and t exist that match the constraints and 
silhouette consistently. 



Constraints on transformation parameters. We will be assisted by the 
fact that the projection of the constraints can be described by equations that 
are linear in a, b and t. However, because a and b are derived from trigonometric 
functions they cannot assume arbitrary values. So we first formulate constraints 
on these possible values. 

We can show (derivation omitted) that a and b are constrained by: 

-1 < |a| - l&l <1; - (|a| -b |&|)| < 1 < |a| -b |6| 

and that any a and b that meet these constraints lead to valid values for 6 and 
0. For any of the four possible choices of sign for a and 6, these constraints are 
linear on a and b. 
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Constraints from new silhouette. Now consider again a cross-section 
of the constraints, Ci, and of the filled silhouette, Ri (see Figure Q). Ci 
is a rectangle, with known vertices di,i, Specifically: di,i = 

: Qi^min) d^^2 — (yPi^maxt Qi^min) = {P^ ,min Q i ^max) di^4 — 

{Pi^max 7 Qi^max ) ■ 

Under projection, these vertices map to a horizontal line with y = i . We 
will consider constraints from just one y = i plane, and drop i to simplify 
notation. Call the x coordinates of these projected points d'l, d^^ d'^, d^. That is, 
d' = (a, 0, h) ■ dj -I- t. Notice that the sign of a and b determine which of these 
points have extremal values. For example, a,6 > 0 => d'j^ < d2-,d'^ < dg. We 
continue with this example; the other three cases can be treated similarly. 

Ri is a line segment, with end points whose x values we’ll denote by ei,C2, 
with ei < 62. Since M is constrained to lie inside Ci in the y = i plane, we 
know that e\ and 62 must lie in between the two extremal points. That is: 
d'l < 6i, 62 < dg. Furthermore, we know that M touches every side of Ci. This 
means that the projection of each side must include at least one point that is in 
R. This will be true if and only if: ei < d^, 61 < d^, d^ < 62, d^ < 62. 

These are necessary and sufficient constraints for r to be a possible silhouette 
of the shape that produced p and q. Finally, since d' = (a, 0,6) ■ dj + t these 
constraints are linear in a, 6, and t. As noted above, for a,b > 0 we also have 
linear constraints on a and 6 that express necessary and sufficient conditions for 
them to be derived from rotations. So we can check whether a new silhouette is 
consistent with two previous ones using linear programming. 

Because of noise, the constraints might become slightly infeasible. It is the- 
refore useful to specify a linear objective function that allows us to check how 
close we can come to meeting the constraints. We can write the constraints as, 
for example, (a, 0, b)di^i + t < Ri min — ^- Then we run a linear program to satisfy 
these while maximizing A. The constraints we have derived are all met if these 
constraints are met with A > 0. If A < 0 then A provides a measure of the degree 
to which the constraints are violated. 

3.2 Silhouettes with Complex Cross-Sections 

Up to now, we have assumed that a horizontal cross-section of a silhouette con- 
sists of a single line segment. This will not generally be true for objects with 
multiple parts, holes, or even just concavities. These multi-line silhouettes com- 
plicate the relatively simple picture we have derived above. We wish to make 
several points about multi-line silhouettes. First, if we fill in all gaps between 
line segments we can derive the same straightforward constraints as above; these 
will be necessary, but not sufficient conditions for a new silhouette to match two 
previous ones. Second, if we merely require that the new silhouette have a num- 
ber of lines that is consistent with the first two silhouettes, this constraint can 
be applied efficiently, although we omit details of this process for lack of space. 
Third, to exactly determine whether a new silhouette is consistent with previous 
ones becomes computationally more demanding, requiring consideration of eit- 
her a huge number of possibilities, or an explicit search of the space of rotations. 
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But if we make a simple genericity assumption, the complexity can be reduced 
to a small size again. 




Fig. 2. Here we show two silhouettes, P and Q, that have cross-sections that consist 
of two line segments. On the right we show how the i cross-section leads to four par- 
allelograms that must contain the object. Either the dark grey pair (01,1,02,2) or the 
patterned pair (01,2, 02,1) must contain parts of the object. A third viewing angle, labe- 
led feasible angle is shown for which this object may appear as a single line segment. 
An infeasible angle is also shown; from this viewpoint the object must produce two line 
segments in the image. Our system uses this constraint, though we omit details of how 
this is done. In some cases we make a continuity assumption across cross-sections. On 
the left, this means that, for example, if Pi,i matches Qi,i (oi,i contains part of the 
object) then Pi+1,1 matches Qi+1,1. 



In the example shown in Figure |21 we can suppose either that Oi,i and 02,2 are 
occupied, or that 01,2 and 02,1 are occupied (there are other possibilities, which 
we can handle, but that we omit here for lack of space). When we assume, for 
example, that oi,i contains part of the object we can say that Pi,i is matched to 
Qi,i. Each of the two possible ways of matching (Pi, 1, Pi, 2) to (Qi,i, Qi,2) must be 
separately pursued, and gives rise to separate constraints that are more precise 
than the coarse ones we get by filling in the gaps in multi-line cross-sections. 
Suppose that for k consecutive cross-sections, the first two silhouettes each have 
two line segments. If we consider all possible combinations of correspondences 
across these cross-sections, we would have 2 ^ possibilities, a prohibitive number. 
But we can avoid this with a simple genericity assumption. We assume that in 3 - 
D, it does not happen that one part of an object ends exactly at the same height 
that another part begins. This means that given a correspondence between line 
segments at one cross-section, we can typically infer the correspondence at the 
next cross-section. 

3.3 Occlusion 

The methods described above can also be applied to partially occluded silhouet- 
tes. To do this, something must be known about the location of the occlusion. For 
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Fig. 3. Seven objects used in experiments. 



example, if a cross-section is known to be occluded in one silhouette, that cross- 
section can be discarded. If a cross-section is known to be partially occluded in 
the third silhouette, the visible portion can be required to lie inside the projec- 
tion of the constraining parallelogram derived from the other two. Occlusion may 
not only make it impossible to derive constraints from occluded cross-sections, 
it may also create uncertainty in determining which cross-sections correspond to 
each other. For example, if the bottom of an object is blocked in a third view, 
we will not know how many cross-sections are occluded. We can solve this by 
searching through different scalings of the silhouette, which imply different pos- 
sible ways of matching its cross-section to the first two silhouettes. We can then 
select the scale or scales that allow the resulting constraints to be met. 

3.4 Experiments 

We test these ideas using the objects shown in Figure^l Our experimental system 
varies in which approach we use to handle multi-lines. 

Experiment 1: First, we experiment with coarse constraints that fill in 
any gaps present in a silhouette cross-section. Also, we heuristically throw away 
some constraints that may be sensitive to small misalignments between different 
silhouettes. In this experiment we use five silhouettes taken from a figure of Snow 
White (Figure EJ photographed after rotations of 20°. First, all ten triplets 
of these silhouettes are compared to each other. In all cases they are judged 
consistent (A > 0). Next, we compared each pair to 95 silhouettes, taken from 
the objects shown in Figure 0 About 6% of these other objects are also judged 
consistent with two Snow Whites (see Figure 0). 



1 1 1 t t 




Snow White Pairs 



Fig. 4. Silhouettes of Snow White, numbers one to ve from left to right. On the right, 
experimental results. 
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Fig. 5. On the left, silhouettes of Hermes. On the right, experiments comparing pairs 
of these to either a third silhonette of Hermes ( rst ten data points, shown as circles) or 
to silhouettes of other objects (shown as crosses). A horizontal line at A = —5 separates 
correct answers with one false positive. 



Experiment 2: Next, we performed a similar experiment using the silhou- 
ettes of Hermes in Figure 0 The axis of rotation was tilted slightly, so that the 
images do not exactly lie on a great circle on the viewing sphere. We heuristically 
compensate for this by searching for good in-plane rotations. For all 10 triples of 
silhouettes of Hermes, we obtain values of A ranging from -4.4 to 1.7. However, 
when we compare randomly chosen pairs of Hermes silhouettes to randomly cho- 
sen silhouettes of the other five objects, we obtain only one case in twenty-five 
with A larger than -4.4; other values are much smaller (see Figure 0). 



lyiWlWIW 



Fig. 6. The two gures on the left show the rst and second silhouettes of the object 
used in experiment 4. The third silhouette shows this object from a new view. The 
fourth silhouette shows the same view with the object scaled so that its cross-sections 
are 1.8 times as big. This silhouette cannot be matched to the rst two without greatly 
violating the constraints (A < —12). 



Experiment 3: We now show an experiment in which we search through pos- 
sible correspondences between different object parts. We use a synthetic shape, 
vaguely like a human torso (Figure E|). Given three silhouettes, there are four 
possible correspondences between the “hands” . We consider all four, then use the 
continuity constraint to determine correspondences at subsequent cross-sections. 
We compare two silhouettes to a third in which the shape has been “fattened” , 
so that the cross-section of each part is scaled by a constant factor. When scale is 
1, therefore, the third silhouette comes from the same object that produced the 
first two. In Figure 0 we show how A varies with scale. We also show what hap- 
pens if we do not hypothesize correspondences between the parts of the figure, 
but just fill in gaps in multiline cross-sections to derive a simple, conservative 
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Fig. 7. On the left, we show how A varies as the third silhouette scales. A scale of 1 
means the third silhouette comes from the same object as the rst two. On the right, 
we show this when individual parts are not matched, and only coarser constraints are 
derived by lling in all gaps in each cross-section of each silhouette. Note that the 
horizontal scale is approximately ten times larger on the right. 



set of constraints. For an object like this, with small parts widely separated, this 
approach is much too conservative. 








Fig. 8. On the left, the third Snow White silhouette, half occluded. The next two 
images show one hypothesized occlusion of the rst two silhouettes that match this; 
the last two images show a second (false) match. 



Experiment 4: Finally, we experiment with a case in which the first two 
Snow White silhouettes are matched to the third, but the bottom half of the 
third is occluded. In this case we try matching this half-silhouette to some top 
portion of the other two silhouettes, considering all possible top portions. We 
find two regions in the set of possible scales in which the third silhouette matches 
portions of the first two within one pixel of error; either when the first two are 
supposed about half occluded (the correct choice) or when they are supposed 
about 70% occluded (incorrect). Both are shown in Figure 0 

4 Conclusions 

We have analyzed the problem of object recognition using silhouettes. We es- 
pecially focus on the problem in which our knowledge of an object comes from 
seeing it from only a few viewpoints, under relatively unstructured viewing con- 
ditions, and in which we do not have a priori knowledge that restricts the model 
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to belong to a special class of objects. This situation has not been much ad- 
dressed, presumably because in this case it is not possible to derive a definite 
3D model. However, we show that even though we may have considerable un- 
certainty about the 3D object’s shape, there is still a lot of information that we 
can use to recognize the object from new viewpoints. This fits our general view 
that recognition can be done by comparing images, if our comparison method is 
based upon the knowledge that images are the 2D projections of the 3D world. 

Our analysis has been restricted to the case where the objects or camera 
rotate about an axis that is parallel to the image plane. This is a significant 
restriction, but we feel that this case is worth analyzing for several reasons. 
First, it occurs in practical situations such as when a mobile robot navigates in 
the world, or in images generated to study human vision. Second, this analysis 
gives us insight into the more general problem, and provides a starting point for 
its analysis. 
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Abstract. Energy-minimizing techniques are an interesting approach 
to the segmentation problem. They extract image components by de- 
forming a geometric model according to energy constraints. This pa- 
per proposes an extension to these works, which can segment arbitrarily 
complex image components in any dimension. The geometric model is a 
digital surface with which an energy is associated. The model grows in- 
side the component to segment by following minimal energy paths. The 
segmentation result is obtained a posteriori by examining the energies 
of the successive model shapes. We validate our approach on several 2D 
images. 



1 Introduction 

A considerable amount of litterature is devoted to the problem of image segmen- 
tation (especially 2D image segmentation). Image components are determined 
either by examining image contours or by looking at homogeneous regions (and 
sometimes using both information) . The segmentation problem cannot generally 
be tackled without adding to that information some a priori knowledge on image 
components, e.g., geometric models, smoothness constraints, reference shapes, 
training sets, user interaction. This paper deals with the segmentation problem 
for arbitrary dimensional images. We are interested in methods extracting an 
image component by deforming a geometric model. The following paragraphs 
present classical techniques addressing this issue. 

Energy -minimizing techniques ^ have proven to be a powerful tool in this 
context. They are based on an iterative adaptation process, which locally deforms 
a parametric model. The model/image adequation is expressed as an energy, 
which is minimal when the model geometry corresponds to image contours. The 
continuity of the geometric model and tunable smoothness constraints provide a 
robust way to extract image components, even in noisy images. The adaptation 
process is sensitive to initialization since it makes the model converge on local 
minima within the image. The parametric definition of the model also restricts 
its topology to simple objects. Recent works now propose automated topology 
adaptation techniques to overcome this issue, both in 2D HD] and in 3D p. 
However, these techniques are difficult to extend to arbitrary dimensions. 
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Front propagation techniques have been proposed to avoid the topology re- 
striction induced by the model parameterization. Instead of deforming a geo- 
metric model in the image, they assign a scalar value to each point of the image 
space. The evolution of the points value is governed by partial differential equa- 
tions, similar to the heat diffusion equation. The model is then implicitly defined 
as a level-set of this space, which is called a front. The equations are designed 
to make the front slow down on strong contours and to minimize its perimeter 
(or area in 3D) |2|. The implicit definition of the front ensures natural topology 
changes. However, this technique is not designed to integrate a priori knowledge 
on the image component (e.g., other geometric criteria, reference shape). 

In region growing methods the extraction of an image component follows 
two steps: (i) seeds are put within the component of interest and (ii) these seeds 
grow by iteratively adding pixels to them according to a merging predicate (ho- 
mogeneity, simple geometric criterion). These methods are interesting because 
on one hand they have a simple dimension independent formulation and on the 
other hand they can segment objects of arbitrary topology. However, they are 
not well adapted to the extraction of inhomogeneous components. 

This paper proposes an original approach based on a discrete geometric model 
that follows an energy- minimizing process. The discrete geometric model is the 
digital boundary of an object growing within the image. The model energy is 
distributed over all the boundary elements (i.e., the surfels). The energy of 
each element depends on both the local shape of the boundary and the sur- 
rounding image values. The number of possible shapes within an image grows 
exponentially with its size. Therefore, the following heuristic is used to extract 
components in an acceptable time. The model is initialized as an object inside 
the component of interest. At each iteration, a set of connected elements (i.e., 
a voxel patch) is locally glued to the model shape. The size and position of this 
set are chosen so that the object boundary energy be minimized. This expansion 
strategy associated with proper energy definitions induce the following model be- 
havior: strong image contours forms “wells” in the energy that hold the model, 
the growth is at the same time penalized in directions increasing the area and 
local curvature of the object boundary, the model grows without constraints 
elsewhere. 

This model casts the energy-minimizing principle in a discrete framework. 
Significant advantages are thus obtained: reduced sensibility to initialization, 
modeling of arbitrary objects, arbitrary image dimension. The paper is orga- 
nized as follows. Section 2 recalls some necessary digital topology definitions 
and properties. Section 3 defines the model geometry and its energy. Section 4 
presents the segmentation algorithm. Segmentation results on 2D images are 
presented and discussed in Section 5. 

2 Preliminary Definitions 

A voxel is an element of the discrete n-dimensional space Z", for n > 2. Some 
authors im use the term “spel” for a voxel in an ri-dimensional space; since we 
feel that no confusion should arise, we keep the term “voxel” for any dimension. 
Let M be a finite “digital parallelepiped” in Z". An image I on Z" is a tuple 
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voxel I I surfel of dO 

voxel in O surfel adjacency 



Fig. 1. Local computation of surfel adjacencies in the 2D case. The 4-adjacency (resp. 
8-adjacency) has been chosen for the object elements (resp. background elements). 



(M, /) where / is a mapping from the subset M of Z”, called the domain of I, 
toward a set of numbers, called the range of I. The value of a voxel u G M in the 
image / is the number f{u). An object is any nonempty subset of the domain 
M. The complement of the object O in M is denoted by O'^. 

Let u>n be the adjacency relation on Z" such that ujn{u,v) is true when u 
and V differ of ±1 on exactly one coordinate. Let be the adjacency relation 
such that an{u,v) is true when u ^ v, and u and v may differ of either —1, 0, or 
1 on any one of their coordinates. If p is any adjacency relation, a p-path from a 
voxel V to a voxel re on a voxel set A is a sequence Uq = v, . . . , Um = w of voxels 
of A such that, for any 0 < i < m, Ui is p-adjacent to Its length is m -|- I. 

For any voxels u and v with ujn{u,v), we call the ordered pair (u,v) a surfel 
(for “surface element” ^3)- nonempty set of surfels is called a digital sur- 
face. For any given digital surface E, the set of voxels {v \ (u,v) G E} is called 
the immediate exterior of E and is denoted by lE(Z'). The boundary dO of an 
object O is defined as the set {(u,?^) | uin{u,v) and u G O and v G O^}. 

Up to now, an object boundary is just viewed as a set. It is convenient to have 
a notion of surfel neighbors (i.e., a “topology”) in order to define connected zones 
on an object boundary or to determine an object by tracking its boundary. In 
our case, this notion is compulsory to define a coherent model evolution. Besides, 
defining an object through its boundary is often faster. 

Defining an adjacency between surfels is not as straightforward as defining 
an adjacency between voxels (especially for n > 3). The problem lies in the fact 
that object boundary components (through surfel adjacencies) must separate 
object components from background components (through voxel adjacencies). 
In this paper, we do not focus on building surfel adjacencies consistent with a 
given voxel adjacency. Consequently, given an object O considered with a voxel 
adjacency p, with either p = o;„ or p = we will admit that it is possible 
to locally define a consistent surfel adjacency relation, denoted by /3o, for the 
elements of dO (3D case, see P]; nD case. Theorem 34 of Ref. 0)- For the 2D 
case. Figure 121 shows how to locally define a surfel adjacency on a boundary. 

The /?o-adjacency relation induces f3o~ components on dO. fdo-paths on dO 
can be defined analogously to p-paths on O. The length of a /3o~path is similarly 
defined. The fio-distance between two surfels of dO is defined as the length of 
the shortest /3o-path between these two surfels. The f3o~ball of size r around a 
surfel a is the set of surfels of dO which are at a /3o-distance lesser or equal to 
r from the surfel a. Let U be a subset of dO. We define the border B{E) of E 
on do as the set of surfels of E that have at least one /3o-neighbor in dO \ E. 
The k-border B/.{E) of E on dO, 1 < k, is the set of surfels of E which have a 
/3o“distance lesser to k from a surfel of B{E). 
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3 The Discrete Deformable Boundaries Model 

For the purpose of image segmentation, we introduce a geometric model to fit 
image components. The model geometry (i.e., its shape) is defined as an object in 
the image. It is equivalently defined as the boundary of this object. Note that the 
model is not necessarily connected. The model geometry is aimed to evolve from 
an initial shape toward an image component boundary. This evolution depends 
on an energy associated to each possible geometry of the model. 

We identify the energy of an object to the energy of its boundary. The energy 
of an object boundary dO, denoted by E{dO), is computed by summation of 
the energies of each of its surfels. To get finer estimates, the energy of a surfel 
may depend on a small neighborhood around it on the object boundary. That 
is why we use the notation E{a)Qo to designate the energy of the surfel a with 
respect to dO (a must be an element of dO). By definition, we set E{E)qo = 
E)(j)qo^ where E is any nonempty subset of an object boundary dO. 

The surfel energy is the sum of several energy terms. Two types of surfel ener- 
gies are distinguished: the surfel energies that only depend on the local geometry 
of the boundary around the surfel are called internal energies, the surfel energies 
that also depend on external parameters (e.g., local image values) are called ex- 
ternal energies. The local geometric characteristics required for the surfel energy 
computation are based upon a neighborhood of the surfel on the object bound- 
ary: it is a /3o“ball on the object boundary, centered on the surfel. To simplify 
notations, we assume that this ball has the same size p for the computation of 
every surfel energy. The whole surfel energy is thus locally computed. 

Internal energies allow a finer control on the model shape (e.g., smoothness). 
External energies express the image/model adequation or other external con- 
straints (e.g., similarity to a reference shape). The following paragraphs present 
a set of internal and external energies pertinent to our segmentation purpose. 
This set is by no way restrictive. New energies can be specifically designed for a 
particular application to the extent that the summation property is satisfied. 

Image features are generally not sufficient to clearly define objects: the 
boundary can be poorly contrasted or even incomplete. To tackle this problem, 
we use the fact that the shape which most likely matches an image component 
is generally “smooth” . In our case, this is expressed by defining two internal 
energies. The stretehing energy E‘^{(j)do of a surfel a is defined as an increasing 
function of the area of the surfel a. The bending energy E^{a)go of a surfel cr is 
defined as an increasing function of the mean curvature of the surfel a. Exam- 
ples of area and mean curvature computations are given in Section 0 Note that 
these energies correspond to the internal energies of many deformable models 
10, which regularize the segmentation problem. 

In our context, we define a unique external energy, based on the image value 
information. Since the model evolution is guided by energy minimization, the 
image energy E^{a)go of a surfel a should be very low when a is located on a 
strong contour. A simple way to define the image energy at a surfel a = {u, v) on 
an object boundary is to use the image gradient: E^{a)go = ~ll/(^) ~ /('“)IP) 
if / = (M, /). This definition is valid for arbitrary n. 

The model grows by minimizing its energy at each step. Since the growing 
is incremental, an incremental computation of the energy would be pertinent. 
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More precisely, the problem is to compute the energy of an object O' given the 
energy of an object O included in O' . In our case, the digital surfaces dO and 
do' generally have much more common surfels than uncommon surfels (as the 
model is growing, this assertion is more and more true). The set of the common 
surfels is denoted by <P. The p-border of ^ on dO is identical to the p-border of 
on dO' . We can thus denote it uniquely by Bp{<P). The energy of dO and dO' 
is expressed by the two following equations: 

E{dO)= Y, E{a)so = E{<l>\Bp{<l>))oo + E{Bp{<P))eo + E{dO\<P)9o, 

a^dO 

E{dO') = Y E{a)ao' = E{^ \ Bp{<^))go' + E{Bp{^))go' + E{dO' \ ^)go'. 

(TGdO' 

From the surfel energy computation, it is easy to see that the surfels of 
<P \ Bp{<P), common to both dO and dO' , hold the same energy on dO and on 
do' . However, the energy of the surfels of Bp{<P) may (slightly) differ whether 
they are considered on dO or on dO' . We deduce 

E(dO') - E{dO) = E{dO' \ <P)go' - E{dO \ <P)go + E{Bp{<P))go> - E{Bp{<P))go . 

' V ' ' V ' ' V ' ' V ^ 

variation created surfels deleted surfels surfels close to displacement 

To get efficient energy computations at each step, each surfel of the model 
stores its energy. When a model grows from a shape O to a shape O' , the energy 
of only a limited amount of surfels will have to be computed: (i) the energy of 
created surfels and (ii) the energy of the surfels nearby those surfels. 



4 Segmentation Algorithm 

In the energy-minimizing framework, the segmentation problem is translated into 
the minimization of a cost function in the space of all possible shapes. Finding 
the minimum of this function cannot be done directly in a reasonable time. 
Except for very specific problems (e.g., what is the best contour between two 
known endpoints), heuristics are proposed to extract “acceptable” solutions. For 
the snake, an “acceptable” solution is a local minimum. We propose a heuristic 
that builds a set of successive shapes likely to correspond to image components. 

We first briefly outline the segmentation process. The model is initialized as 
a set of voxels located inside the object to be segmented. At each step, a voxel 
patch is added to the model. A voxel patch of radius k around a surfel a on 
the boundary dO is the immediate exterior of the /3o“ball of size k around a. 
To decide where a voxel patch is “glued” to O, its possible various locations 
are enumerated. Among the possible resulting shapes, the one with the smallest 
energy is chosen. Unlike most segmentation algorithms, this process does not 
converge on the expected image component. However, the state of the model at 
one step of its evolution is likely to correspond to the expected image component. 
The boundary of the object of interest is hence determined a posteriori. This 
technique is similar to the discrete bubble principle 0. 
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It is then possible to let the user choose the shape pertinent to his problem 
among the successive states of the model. A more automated approach can also 
be taken. Since the model growing follows a minimal energy path (among all 
possible shapes within the image), the image/model adequation can be estimated 
through the model energy at each step. Consequently, the shape of minimal 
energy often delineates an object which is a pertinent component of the image. 

For now, k is a, given strictly positive integer number. It corresponds to the 
radius of the voxel patch that is added on the model at each step. The following 
process governs the growing evolution of the model and assigns an energy to 
each successive model state (see Fig. EJ: 

1. Assign 0 to i. Let Oq be equal to the initial shape. The initial shape is a 
subset of the image domain included in the component (s) to extract. Let Eq 
be equal to E{dOo). 

2. For all surfels a G dOi, perform the following steps: 

a) Extract the /^orball of radius k around a. Define Oa as OiUlE{Va). 

b) Incrementally compute E{dOa) from E{dOi). 

3. Select a surfel r with minimal energy E{dOr)- 

4. Let Oi+i be equal to Or - Let Ei+i be equal to E{dOr)- Increment i. 

5. Go back to step 2 until an end condition is reached (e.g., O = M, user 
interaction, automated minimum detection). 

In the experiments described in Section 5, the end condition is O = M, which 
corresponds to a complete model expansion. 




surfel of dOi added surfels ll surfels close to displacement 

and its /3o^-ball deleted surfels 



Fig. 2. A step in the model growing, (a) cr is a surfel of the model boundary at step i. 
(b) The voxel patch of radius 4 around a is added to the object, (c) Model shape at 
step i + 1 if cr is chosen as the best place to make the model grow. 



5 2D Experiments 

In order to validate this segmentation approach, a 2D prototype has been imple- 
mented. The experiments emphasize the ability of our model to segment image 
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components in various contexts: incomplete or weakly contrasted contours, inho- 
mogeneous components. In 2D, voxels correspond to pixels and surfels are usually 
called pixel edges. We first consider how energies are computed and weighted. 
We then highlight the model capabilities on both synthetic and medical images. 

5.1 Energy Computation 

For our experiments, the energies for a surfel cr = (u, v) are computed as follows: 

— The stretching energy E^{a)do is the contribution of the surfel a to the 
model perimeter. The boundary dO can be viewed as one or several 4- 
connected paths. We adapt the Rosen-Profitt estimator to measure the con- 
tribution of cr to the boundary length. On the boundary path, cr has two end- 
points Pi and p 2 ■ These points can either be “corner” points or “non-corner” 
points. We set E‘^{a)do = (V'(Pi)) V’(P 2 ))/ 2 , where tpip) ='4’c = 0.670 if p is 
a corner point and '0(p) = V'nc = 0.948 otherwise. Note that other perimeter 
estimators could be used (e.g., see 0 ). 

— The bending energy E^{a)do is computed as the ratio of the two distances I 
and D, defined as follows. The /3o-ball of size p around cr forms a subpath of 
do, which has two endpoints ei and 62 . Its length I is computed by the same 
Rosen-Profitt estimator as above. The distance D is the Euclidean distance 
of Cl and 62 - Many curvature estimation methods could be implemented 
(angular measurement, planar deviation, tangent plane variation, etc. I12l l. 

— The image energy E^ {a)go is defined as in Sectional as -||/(^^)-/(u)||^ It 
is the only external energy used in the presented experiments. 

The energy of a surfel a on the boundary 80 is the weighted summation of 
the above-defined energies: 

E{(t)qo = asE’^{(j)do + OibE^{a)do + OiiE^ {a)go, 

where as, ab and aj are positive real numbers whose sum is one. To handle 
comparable terms, internal energies are normalized to [0,1]. The image energy 
is normalized to [—0.5, 0.5] to keep in balance two opposite behaviors: (i) should 
the image energy be positive, the shape of minimal energy would be empty, (ii) 
should the image energy be negative, long and sinuous shape would be favored. 
Note that most classical deformable models choose a negative image energy 
function. These coefficients allow us to tune more precisely the model behavior 
on various kinds of images. A set of coefficients pertinent to an image will be 
well adapted to similar images. 

5.2 Results 

In all presented experiments, we consider the model with the 4-connectedness. 
The surfel adjacency (3o is therefore defined as shown on Fig. The parameter 
p is set to 4. 

Since our model searches for contours in images, inhomogeneous components 
can efficiently be segmented. We illustrate this ability on the test image of Fig.^L. 
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This test image raises another segmentation issue: contours are weak or even 
inexistant on some locations both between the disk and the ring and between the 
disk and the background. Fig. Eb emphasizes three steps in the model evolution: 
the initial shape, the iteration when the model lies on the disk-ring boundary 
(a local minimum in the energy function), the iteration when the model lies on 
the ring-background boundary (the minimum of the energy function). Fig. Efc 
displays the energy function. At each iteration, several patch radius sizes (i.e., k) 
have been tested for each surfel: k was between 0 and 6. The model successfully 
delineates the two boundaries during its evolution, although the first boundary 
induces a less significant minimum in the energy function than the second one: 
the first boundary is indeed not as well defined as the second one. 




Fig. 3. Inhomogeneous components segmentation, (a) Test image: a circle lied with a 
shading from gray value 98 (top) to gray value 189 (bottom), a ring encircling it lied 
with a sharper shading from 0 (top) to 255 (bottom), a homogeneous background of 
gray value 215. (b) Three sign! cant steps in the model evolution: initial shape, disk 
ring boundary, ring background boundary, (c) Energy curve for this evolution: the two 
extracted boundaries correspond to local minima of the energy function. 



The second experiment illustrates the robustness of the segmentation process 
compared to the initial shape. The test image is a MR imageQ of a human heart 
at diastole (Fig.^b)- Our objective is to segment the right ventricle. This image 
component is inhomogeneous in its lower part and presents weak contours on its 
bottom side. The other contours are more distinct but are somewhat fuzzy. All 
these defects can be apprehended on Fig. 03-c, which show the image after edge 
detection. The middle row (Fig.Sl-f) presents three evolutions, one per column, 
with three different initial shapes. Each image depicts three or four different 
boundaries corresponding to significant steps in the model evolution. The bottom 
row (Fig. ^-i) displays the corresponding energy curve. For this experiment, 
only the patch radius size 3 is tested for each surfel (i.e., k = 3). Whichever is 
the initial shape, the model succeeds in delineating the right ventricle. The left 
ventricle may also be delineated in a second stage (near the end of the expansion). 



^ Acknowledgements to Pr. Ducassou and Pr. Barat, Service de Medecine Nucleaire 
Hopital du Haut Leveque, Bordeaux, France. 
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but it is more hazardous: the proposed initial shapes are indeed extremely bad 
for a left ventricle segmentation. 




Fig. 4. Robustness of the segmentation to initialization, (a) Test image: a MR image 
of a human heart at diastole, (b) Image after Sobel Itering. (c) Image after Laplace 
edge detection. The bottom two rows depict the model behavior for three di erent 
initial shapes (energy parameters are set to Oa = 0.25, at, = 0.25, a/ = 1). The middle 
row (d-f) shows signi cant steps in the model evolution (initialization, important local 
minima). The corresponding energy curves are drawn on the bottom row gures (g-i). 



Our prototype does not include all the optimizations which could be imple- 
mented for the 2D case (e.g., efficient traversal of surfel adjacency graphs, various 
precomputations). For the heart image, whose size is 68 x 63, the complete model 
evolution takes 349s. The right ventricle is detected after 40s. 

6 Conclusion 

We have presented a discrete deformable model for segmenting image compo- 
nents. The segmenting process is carried out by expanding a digital surface 
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within the image under internal and external constraints. The external con- 
straint (i.e., the image energy) stops the model expansion on strong contours. 
At the same time, the internal constraints regularize the model shape. The ro- 
bust framework of energy-minimizing techniques is thus preserved. The signif- 
icant advantages of this model are its dimension independence and its ability 
to naturally change topology. The first results on both synthetic and real-life 
data are very promising. They underline the model abilities to process images 
with poor contours and inhomogeneous components. Moreover, our segmenta- 
tion technique is less sensitive to initialization than classical energy-minimizing 
techniques. Further information can be found in 0. 
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Abstract. This paper considers camera motion extraction with appli- 
cation to automatic video classification. Video motion is subdivided into 
3 components, one of which, camera motion, is considered here. The ex- 
traction of the camera motion is based on correlation. Both subjective 
and objective measures of the performance of the camera motion extrac- 
tion are presented. This approach is shown to be simple but efficient and 
effective. This form is separated and extracted as a discriminant for video 
classification. In a simple classification experiment it is shown that sport 
and non-sport videos can be classified with an identification rate of 80%. 
The system is shown to be able to verify the genre of a short sequence 
(only 12 seconds), for sport and non-sport, with a false acceptance rate 
of 10% on arbitrarily chosen test sequences. 



1 Introduction 

The classification of videos is becoming ever more important as the amount of 
multimedia material in circulation increases. Automatic classification of video 
sequences would increase usability of these masses of data by enabling people to 
search quickly and efficiently multimedia databases. There are three main sources 
of information in video: first the audio which has been extensively researched 
in the context of coding and recognition. Secondly, individual images which can 
be classified by their content. Thirdly, the dynamics of the image information 
held in the time sequence of the video and it this last attribute that makes video 
classification different to image classification. It is this dynamic aspect of video 
classification, at the highest level, we investigate here. 

The most successful approaches to video classification are likely to use a 
combination of static and dynamic information. However in our approach we 
use only simple motion measures to quantify the contribution of the dynamics 
to high level classification of videos p. Approaches to motion extraction range 
from region-based methods such those proposed by Bouthemy et al j2j to tracking 
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objects as described by Parry et al j3|. It is generally accepted that motion 
present in videos can be classified into two kinds, foreground and background. 
The foreground motion is the motion created by one or multiple moving objects 
within the sequence. The background motion, usually the dominant motion in 
the sequence, is created by movement of the camera. 

Those approaches that attempt to classify the background motion do so 
usually to label the video with the type of movement of the camera, as for 
example the work of Bouthemy ^ and Xiong . Here we propose a method for 
separating the foreground object and background camera motions. Haering 0 
uses foreground and background motion in combination with some static features 
to detect wildlife hunts in nature videos. We hypothesize that the separation 
of these two motion signals will benefit the classification of videos based on 
dynamics. Some preliminary results for the discriminating properties of camera 
motion alone in the context of a simple classification task of sport and non-sport 
are presented. The sports are chosen for their reasonably high pace during a 
game to make the illustrative example viable, e.g. Soccer, Rugby. 



2 Video Dynamics 

Three forms of video dynamics can be identified: two are motions within the 
sequence, namely foreground and background and the third which is of a differ- 
ent origin and is manually inserted in the form of shot and scene changes. Here 
the shot changes are automatically detected and discarded using pair-wise pixel 
comparisons. There are more robust approaches to shot or cut detection, for ex- 
ample Xiong . Porter et aZ |7] , use frequency domain correlation, however this 
is not currently integrated into the system and is the subject of further work. In- 
terestingly Truong et al |H| suggest that the shot length has some discriminatory 
information and therefore this information is used in their genre classification of 
image sequences. We wish to separate all three of these motion signals in order 
to assess the classification potential of each individually. Here we consider the 
camera motion. 

There can be some ambiguity in camera motion terminology, so here we 
clarify our interpretations of certain terms: pan is a rotation around a vertical 
axis, tilt is a rotation around an horizontal axis, Z or image rotation is a rotation 
around the optical axis. X and Y translation is caused by the camera moving 
left-right and up-down respectively normal to the optical axis, zoom in/out is a 
focal length change, and Z translation is a linear movement of the camera along 
the optical axis. It is actually difficult to discriminate automatically between 
pan, tilt, and X, Y translations because they introduce only subtly different 
perspective distortions. These distortions occur around object boundaries, and 
are most prominent where the difference in depth is large. It is the nature of 
the occlusions that is used to differentiate between the motions. For zoom and Z 
translation the objects in the image are magnified differently according to their 
depth position. 
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We can distinguish two main classes of models for camera movement: The 
first class aims at investigating the influence of the camera motion in the frames. 
This requires a 2D affine motion model extracted using general optical flow 
approaches. Some of these approaches use a 6-parameter camera model mni 
describing the three rotations (pan, tilt, Z rotation) and three translations (X, 
Y, Z) of the camera. Generally, the model is reduced to 3-parameters 1 11151 . 
merging the terms pan and X translation, tilt and Y translation, disregarding 
the Z rotation, and replacing the Z translation by the notion of zoom. 

The second class has the purpose to characterize the relative camera location 
in a 3D space. These claim to extract a 9-parameter camera model, for example 
H2|. In addition to the 6-parameter model with the three translations and three 
rotations, the X and Y focal lengths, and the zoom operation are included, also 
presented by Sudhi et al D3- Some of these approaches use constraints on the 
epipole However this second class of approaches, although more accurate 
than the smaller model approaches, are obviously complicated, computationally 
expensive and the parameters are unlikely to be accurately determined from 
video analysis. 

Here we use a simple three-parameters motion model to deal with camera 
motion X, Y and Z. Left and right, or X motion includes X translation and pan. 
Up and down, or Y motion includes Y translation and tilt, and Z motion includes 
Z translation and focal length modifications or zoom. This provides relative 
computational efficiency and is predicted to include most of the discriminatory 
information available for high-level classification. 

3 The Correlation Approach 

In this section, we describe the correlation approach used here to extract camera 
motion. In Section 1.4. 1 1 we see how to obtain motion vectors to obtain an optical 
flow for blocks within the image using correlation. The first step consists of 
calculating a Pixel Translation and Deformation Grid (PTDG) between two 
consecutive frames using matching blocks. For each pixel in the image, the PTDG 
gives the translation and the deformation of its neighborhood. 

In Section \'a:z\ we show how to extract the camera motion from the opti- 
cal flow. A simple segmentation of this PTDG attempts to detect the different 
moving objects and to quantify their translations in the image. The dominant 
motion, over a certain threshold, is assumed to be caused by a mobile camera. 
Although a large rigid object in the foreground can create a similar effect this a 
rare occurrence and is not dealt with here. The translation of the background and 
the zoom in/out of the camera is deduced from this PTDG along a 3-parameter 
camera model. The rotation about the optical axis is not considered as it too is 
rare and is mainly found in music videos, a class we do not currently consider. 

3.1 The NCS Principle 

From the current frame, we extract a block of nxn pixels, which serves as a 
reference mask. The aim is to find the most similar block of pixels in the previ- 
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reference mask 
test masks 



search neighborhood 
motion vector 



Fig. 1. Search for the matching mask from three different reference masks, (a) good 
correlation, (§) uniform area, © deformed area 

ous frame by correlating the reference mask with all possible test masks in the 
previous frame. The test mask with the highest correlation coefficient, i.e the 
most similar block, is denoted the matching mask. It has been observed that 
the position of this matching mask is generally relatively close to the original 
position, therefore only a subregion of the image is searched; this is called the 
search neighborhood. The computation of the similarity indcies within a search 
neighborhood N x N is performed by comparing the reference mask with the dif- 
ferent test masks. We obtain this comparison measure by using the normalized 
correlation coefficient Ck^i, for each pixel (k, 1) in the search neighborhood, given 



where (k,l) locates the NxN search neighborhood and defines the center of 
the test masks, (i,j) describe the mask nxn, c S {R,G,B} is the red, green or 
blue byte. Pc is the color value of the pixel in the current image, and Qc is the 
color value of the pixel in the previous frame. 

In figure G] there are three example cases (a), (b) and © illustrating different 
situations. The first, @, is a good correlation with a motion vector shown by the 
arrow. The second, ®, is a uniform area. This gives many good correlations with 
resultant ambiguity. The third case, ©, shows an example of a deformed area 
for which there are no good correlations. A 2D surface, called Neighborhood 
Correlation Surface (NCS) is obtained by plotting the correlation coefficients 
<Lk,i calculated over the search neighborhood. A value tending to 1 expresses an 
exact fitting between the reference mask from the current frame and the test 
mask from the previous frame. 



by: 




( 1 ) 
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In the Figure ^ the reference mask obtained (a) correlates almost exactly with 
a test mask in the search neighborhood. The reliability of this normalized cor- 
relation results from the fact that the coefficient decreases when the similarity 
decreases. The maximum of the NCS is usually represented by a peek. This gives 
us the previous position of the reference mask and therefore the evolution of this 
mask through time (from the matching mask to the reference mask). Figure 0 
illustrates a typical successful correlation: the NCS shows confidently the posi- 
tion of the matching mask. This is the case of the reference mask (a) in Figure 

ID 




Fig. 2. NCS for the reference mask (a): good similarity index 



There are two cases where the NCS fails to indicate clearly the position of the 
matching mask in the previous frame. FigureQshows examples of each. Therefore 
it is necessary to introduce a limit beyond which the position of the matching 
mask, and therefore the subsequent motion vector, is regarded as unusable. 

A first threshold is set to decide whether or not the reference mask and the 
matching mask are considered similar enough i.e. a good correlation. A maximum 
search correlation coefficient not reaching this threshold means that we cannot 
find a reliable matching mask: the reference mask is predicted to be included in 
a non-rigid object or in an area occluded or revealed by moving objects. This is 
referred to as deformation of the mask. This is the case of the reference mask © 
in FigureOlfor which no high peek in the NCS can be seen in Figure 01 The low 
maximum correlation coefficient indicates a deformation of the masks between 
the previous and the current frame. 

A second threshold is set to prevent difficulties which occur when uniform 
zones are present. The correlation coefficients computed on an uniform search 
neighbourhood gives many high correlation coefficient values, and picking the 
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Fig. 3. NCS for the reference mask ©: bad similarity index, deformed area 



maximum value may not ensure an accurate position of the matching mask. 
Therefore an NCS comprising of too many good correlations (coefficients over 
the first threshold) is considered as a uniform zone. This eliminates the need for 
further or apriori computation to ensure that the motion vector is valid, such 
as the variance analysis applied by [b]. The periodicity of peeks in the NCS can 
give an indication of the texture being uniform. This is the case of the reference 
mask (b) in Figure D the NCS for which is given in Figure 0 

For each frame a Pixel Translation and Deformation Grid (PTDG) is com- 
posed. The PTDG provides for each matching mask of the image the x- 
translation, y-translation, the maximum correlation coefficient and the number 
of successful correlations. The deformed masks (low maximum correlation coef- 
ficient: c.f. case ©) and uniform areas (high number of successful correlations: 
c.f. case ®) are both rejected from further computations for camera motion. 



3.2 Camera Motion Extraction 

As stated previously, a relativly simple three-parameter model to deal with cam- 
era motion is used. The term zoom is used to encompass all Z motion. The Z 
rotation (about the optical axis) can be easily extracted from the PTDG, but 
is considered unnecessary for the classical scenes we propose to analyse. The 
first step is to compute a coarse global Z motion. The equation linking pixel 
coordinates and the camera focal length is given in Equation El as adopted by 
Xiong p|. The focal length is assumed to be constant across the lens system. We 
consider only equal and opposite motion vector pairs from geometricaly opposite 
segments of the frame not rejected by the PTDG, as these are most likley to be 
part of the background. The relative segment length is not altered by a global 
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Fig. 4. NCS for the reference mask (§): uniform area 



translation or rotation; we assume that they are altered only by the zoom. Us- 
ing the pairs of motion vectors to calculate the zoom improves the robustness 
to moving objects. We compute what we call the relative zoom average of the 
local zooms given by: 



X Y I' f 

x = fxX ^ y = fyX-^ J = J = Z (2) 

where P= {X,Y,Z} designates the 3D coordinates of the object in a reference 
system which has (Oz) along the optical axis, where p= {x,y} designates the 
image coordinates of the projection of this object in the image. I and V are the 
segment lengths in the current and in the previous frame respectively, fx and fy 
are the x and y focal lengths in the current frame, with fx=fy=f, and f is the 
focal length in the previous frame. 2 ; is the relative zoom. 

Once this first estimation is complete, we adjust the PTDG to contain only 
the X and Y motion of the pixels. The motion vetors can segment the image 
into several groups. If we detect a large and coherent moving group (typically 
more than half the image), we assume that this is the moving background due 
to camera motion. If another coherent group is detected, we reject the pixels 
belonging to this group and then we reiterate the zoom computation to correct 
the PTDG. Obviously if a large rigid object is moving towards or away from the 
camera, the analysis of pairs of motion vectors on its area will succeed and it 
will affect the computation of the zoom. The correction of the PTDG will be 
skewed. However, we have observed that if the size of this rigid object is smaller 
than half the image, it is simple to separate from the background. After the 
background separation, a zoom computation considering only the background 
pixels, further improves accuracy. 
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4 Assessment 

Results of this paper are divided into two parts. Primaraly, the experimental 
result relates to the assessment, both subjective and objective, of the camera 
compensation accuracy. Then the preliminary investigation in to whether the 
camera motion alone holds discriminatory dynamic information is presented. 
Features reflecting the camera motion signals, as in Figure El are statistically 
processed (inverse variance weighted, zero mean) and classified using a Bayesian 
based classifier in the form of a Gaussian Mixture Model (GMM). 

4.1 Assessing the Camera Compensation 

The first approach to assessing the camera compensation was based purely on 
subjective observation. A range of test scenes were observed to assess the com- 
pensation under fast and slow camera motions in the 3 parameters of the motion 
model. The observations were of thresholded differences of adjacent frames of the 
original sequences and compared with those differences after camera compensa- 
tion. A difficult example of these comparison images is shown in Figure El The 
image created from the original sequence has motion in most areas of the image 
due to both object and camera motion as seen inEI(a). Conversely inElJb), the 
compensated sequence shows a large reduction of the motion in the background 
in contrast to the motion around the foreground objects which does not show 
much reduction. With additional morphological noise filtering the background 
movement is further reduced as seen in EIc) . 




(a) (b) (c) 



Fig. 5. Typical Examples of Camera Compensation (a) Original, (b) Compensated, (c) 
Noise Filtered 



The second approach is an objective measure, using artificial test data. Here 
the accuracy in pixels of the predicted camera motion is measured. A test se- 
quence is manually generated so that the ground truth of the camera motion is 
known. Static camera sequences are chosen and the frames are then perturbed 
in space through time to simialte camera motion. The sequences include moving 
for ground objects to make sure the method was robust under various conditions, 
assuming that the dominant motion is the camera’s motion. The Controlled test 
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sequences are then put through the camera compensation algorithm and the 
measured camera motion compared with the ground truth. The average accu- 
racy of the camera motion across the artificial test data was found to be in the 
order of 0.1 of a pixel. 



4.2 Classification of Sport Using Camera Motion 

The second part of the results show the potencial of the camera dynamics in 
classifying video sequences. The class of sport is chosen because this is predicted 
to be relatively different in terms of camera motions and is commonly quoted as 
having high discriminatable dynamics m 

A total of 285 classifications were made, each approximately 12 seconds of 
video. The database used was made up of 28 scenes, 6 sport and 22 others in- 
cluding soap, news, cartoon, movies and nature. Each scene in the database 
comprises multiple shots and is about 40 seconds long giving a total of approx- 
imately 18 minutes. The short sequences used for classification are arbitrarily 
selected by automatically spliting a scene into 12 second sequences. 






Fig. 6. Typical examples of second order camera motion (-1-10,-10 pixels) plotted 
against time (1500 frames). On the left is X motion on the right is Y motion. The 
scenes are (a) Basketball and (b) Soap. 



Two motion signals of the camera, X and Y, for typical scenes can be seen in 
Figure El These motion signals are statistically modelled for sport and non-sport 
for classification. A ’’round-robin” technique is used to maximise the data usage. 
The GMM is trained on all scenes other than the one being tested i.e 27 scenes 
are split into short sequences of 12 seconds. The the test scene is changed and 
the training is done again. The feature extraction is that used for our holistic 
region-based approach described in [Q. 
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Results are split into two: identification and verification. The identification 
error for sport and non-sport on short sequences on dynamics only was 20% 
based on just 12 second shots. Verification results were also obtained to further 
investigate the performance of the system for different applications. The system 
produced a false acceptance of 10% when false rejection was at 50%. 

5 Conclusion 

This proposed correlation approach tends to be simple but efficient and effective. 
The background search allows consideration of all kinds of scenes, and therefore 
ensures feasible cooperation with other methods for video content analysis. More- 
over, the PTDG can easily be exploited to give more information about local 
features, and object annotations. The correlation is a robust estimator of block 
matching, is intensity independent, and based on optical flow. We have also 
shown that the camera motion itself has discriminatory information for video 
classification. The results show that if a search was made for sport the system 
would return 4 sport sequences and 1 non-sport. These results are encouraging 
when analysed because the test sequences were chosen randomly without vetting, 
and a proportion of them contain very little or no motion and the approach can 
not perform at all in these conditions as it is purely a dynamic based approach. 
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Abstract. Object identification by matching is a central problem in 
computer vision. A major problem that any object matching method 
must address is the ability to correctly match an object to its model 
when parts of the object is missing due to occlusion, shadows, ... etc. 
In this paper we introduce boundary signatures as an extension to our 
surface signature formulation. Boundary signatures are surface feature 
vectors that reflect the probability of occurrence of a feature of a surface 
boundary. We introduce four types of surface boundary signatures that 
are constructed based on local and global geometric shape attributes of 
the boundary. Tests conducted on incomplete object shapes have shown 
that the Distance Boundary Signature produced excellent results when 
the object retains at least 70% of its original shape. 



1 Introduction 

Any reliable object recognition system must be able to analyze the world scene 
that the machine “sees” and correctly recognize all objects appearing in the 
scene. At the core of the object recognition subsystem is the matching process, 
where the system compares the result of its scene analyses to its object database, 
and hypothesizes about objects appearing in the scene. A major concern with 
matching is that most matching techniques fail at arriving at the correct match 
when partial information is missing. A typical example, is the case of object 
occlusion where only part of the object is visible to the camera (or vision sensor). 
Such a case usually results in object mismatch and incorrect object hypotheses 
and hence incorrect recognition. 

In our previous work, we introduced surface signatures as robust feature de- 
scriptors for surface matching j I I'Zl'i) . A surface signature is a feature vector that 
reflects the probability of occurrence of a surface feature on a given surface. Sur- 
face signatures are scale and rotation invariant. We showed that by using surface 
signatures correct identification was possible under partial occlusion and shad- 
ows. Previously we employed two types of surface signatures, surface curvature 
signatures and surface spectral signatures, which statistically represent surface 
curvature features and surface color features, respectively. In this paper we in- 
troduce surface boundary signatures as an extension to our surface signature 
formulation. 
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2 Related Literature 

Shape analysis is divided into two categories, boundary-based shape analysis 
that is based on the analyses of the object’s boundary and interior-based shape 
analysis that is based on the analyses of the object’s interior region. Examples 
of boundary based shape analysis are Chain (or Freeman) codes, Polygonal Ap- 
proximations and Shape Signatures. Examples of interior based shaped analysis 
include region moment methods and Medial Axis Transform (MAT) methods 
also known as skeleton techniques. 

The literature on shape analyses methods is vast. In this section we only 
present a sample of recent papers that are relevant to our work. Hong Pj pre- 
sented an indexing approach to 2-D object description and recognition that is 
invariant to rotation, translation, scale, and partial occlusion. The scheme is 
based on three polygonal approximations of object boundaries where local ob- 
ject structural features (lines and arcs) are extracted. Ozcan and Mohan 
presented a computationally efficient approach which utilizes genetic algorithms 
and attributed string representation. Attributed strings were used to represent 
the outline features of shapes. Roh and Kweon |01 devised a contour shape sig- 
nature descriptor in to the recognition of planar curved objects in noisy scenes. 
The descriptor consisting of five-point invariants was used to index a hash ta- 
ble. Nishida |7] proposed an algorithm for matching and recognition of deformed 
closed contours based on structural features. The contours are described by a 
few components with rich features. Kovalev and Petrou 0 extracted features 
from co-occurrence matrices containing description and representation of some 
basic image structures. The extracted features express quantitatively the rela- 
tive abundance of some elementary structures. Mokhtarian et. al P| used the 
maxima of curvature zero-crossing contours of curvature scale space image as a 
feature vector to represent the shapes of object boundary contours. For a recent 
survey on shape analysis techniques the reader is referred to m- 

3 Boundary Features and Signatures 

We use the surface’s (or object’s) boundary (H) to extract local and global 
geometric shape attributes that are used to construct four boundary signatures. 
The boundary of any surface (or object) consists of a finite number (A) of an 
ordered sequence of points (A) that defines the shape of the surface (see Fig. 1), 

= {Ai = (cci, - 1} (1) 




Fig. 1. The boundary and its feature vectors. 
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The following assumptions are made about O; O is closed (i.e. Aq follows Aat-i), 
O has a single point thickness (i.e. O has been thinned), O is ordered in a counter- 
clockwise sense (the object is to the left) and O does not contain any internal 
holes. 



3.1 The Boundary Intra-distance 

The boundary intra-distance (dy ) is defined as the Euclidean distance between 
two pair of points, Xj and A^, on O (see Fig. 1), 

dij = d{Xi,Xj) = {xj — Xi) + {Vj ~ Vi) (2) 

Plots of the distance matrix (d) for a circle (ddrcie) £^nd an ellipse (dgjHpse) 
are shown in Fig. 2. For illustrative purposes, each plot is shown from two view- 
points; a side view and a top view. Because of the symmetry inherit in the shape 
of a circle, ddrcie has the unique feature of having the only profile containing 
diagonal lines with constant values. As a circle is stretched in a given direction, 
dcircie looses its symmetry producing deiupse- 



Fig. 2. Plots of d: for a circle-, ddrcin (right) and an ellipse with eccentricity 0.995; 
deUipse (left) 




When parts of an object boundary are missing d will also change from its 
original profile. However, if the amount of boundary missing is relatively small 
then d will not change by much. Let p denote the percentage of boundary miss- 
ing. Fig. 3 shows three variations of a circle with different amount of its origi- 
nal boundary missing along with their corresponding d. We see that when the 
amount of boundary missing is small (p = 0.1) ddrcie-O .9 is very similar to 
dcircie- But as the amount of boundary missing increases the similarity of d to 
its original d decreases. Note that even when large amount of its boundary is 
missing, such as the case for the circle with 50% of its original boundary missing, 
we see that a large similarity still exists between ddrcieJ ).5 and ddrcie that the 
shape can be identified as being (part of) a circle. It is this similarity in d and 
the other boundary features presented in this paper that we exploit in our work 
to arrive at successful matching when partial boundary information is missing. 

3.2 The Boundary Intra-angle 

The boundary intra-angle (a) is defined as the directional angle between bound- 
ary points (see Fig. 1). The boundary intra-angle of point Aj with respect to Xi 
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is given by, 



otij = a(Aj, Aj) = /vjj = arctan(j/j - y^, Xj - Xi) (3) 



Values of a lie in the range [0,27 t[. Note that a is skew symmetric, = —ctji. 



3.3 The Boundary Curvature Angle 

The boundary curvature angle (k°) represents the amount of local boundary 
bending as measured by the local boundary curvature angle (see Fig. 1). k° at 
point Xi is computed by, 

K° = cos~^ (Vj_i • (-Vj)) • sign (Vj_i x v^) (4) 

where the sign function is defined as, sign(cc) = —1 if a: < 0, 1 otherwise. Vj is 
the unit direction vector from point Xi to A^+i. Positive values of k° between 
0 < < 7 T indicate shape concavities while negative values of k° between 

— 7 T < < 0 indicate shape convexities. For example, a circle has values of 

K° = TT everywhere. The profile of an ellipse is similar to that of a circle, but 
has smaller values of k° at the ends of its major axis (dependent on eccentricity 
and profile resolution). 





Fig. 3. Plots of the distance matrix (d) for a circle with di erent amount of its border 
missing (from top to bottom): d^ircU.o .9 (g = 0.1), dcircie_o .7 (p = 0.3) and dcircie.o.s 
(p = 0.5). 



3.4 Chord Bending 

Chord bending ( 7 ), represents the amount of parameter bending between two 
points on Q (see Fig. 1). Chord bending between two points, Xi, Xj e is 
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defined as the ratio of the Euclidean distance between the two points (d^) to 
the parametrical distance between the two points (pij), 

^ij — ) / Vij (^) 

Since O is closed, two different values of p exist between any pair of points, 
one in the counter-clockwise direction and the other in the clockwise direction. 
We define the parameter distance between two points as the shortest of the two 
distances. A plot of the chord bending matrix ( 7 ) for a circle and an ellipse is 
shown in Fig. 4. Note that the profile of an ellipse is distinctly different from 
that of a circle. 




Fig. 4. Plot of the chord bending matrix ( 7 ) for a circle (left) and an ellipse (right). 



3.5 Boundary Signatures 

A surface signature is a feature vector that reflects the probability of occurrence 
of the feature for a given surface. If S denotes a surface signature of size A, then 
by definition, SSi = 1 for 0 < 5'i < 1 where 0 < i < A — 1 [1 . 

Given the boundary matrices, d, a, k° and 7 , surface signatures can be 
constructed for these matrices. The Distance Boundary Signature (Sdb) rep- 
resents the frequency of occurrence of the normalized intra-distance between 
two points for a given intra-distance value. Recall that d{Xi,Xj) represents the 
shape’s intra-distance function between two points, Xi, Xj G fl. Normalizing the 
distance function with respect to the maximum distance, produces the normal- 
ized distance function, d{Xi,Xj), 



d (^Xi^ Xj ) 



d (A2, Xj) 

max (d {Xi, Xj)) 



( 6 ) 



As a result, 0 < d < 1. The advantage of normalizing the intra-distance is that it 
produces a metric that is scale invariant. Let did denote the inverse intra-distance 
function of d, i.e. 

4/d(w) = ri(d(A,A,)) (7) 

where u = d. 'i’a is defined only for u G [0,1]. Let fkddenote the normalized 
inverse intra-distance function obtained by normalizing the inverse intra-distance 
function with respect to the maximum inverse intra-distance value. 



(m) 



(m) 

max(4'd (u)) 



( 8 ) 





568 



A.A.Y. Mustafa 



Values of fall between [0,1]. Taking a closer look at it, we see that it is a 
continuous random variable representing the normalized intra-distance between 
two point pairs on fl. Further examination into the normalized inverse intra- 
distance function, dtd, reveals that it is actually the cumulative distribution 
function (cdf) of the random variable it, 

cdfd(ii) = ^'dw) (9) 

Hence, the probability that the normalized distance (pd) between two point 
pairs lie in the range [ 111 , 112 ], 



Pd{ui < u< U2) 



pdfd(u)du 



cdfd(it 2 ) - cdfd(ui) 



^'d(w2) - ^'d(Ml) 



</Ui 

( 10 ) 

As stated earlier, S^b represents the frequency of occurrence of the normal- 
ized Euclidean distance between two points for a given distance, i.e. Sub is the 
pdf of It, 



Sub = pdfd(w) = 




( 11 ) 



Similar analysis leads to the construction of the other boundary signatures. The 
Angle (or Direction) Boundary Signature (Sab) represents the frequency of oc- 
currence of two boundary points oriented relative to each other at a given angle. 
The Parameter Boundary Signature (Spb) represents the frequency of occur- 
rence of the chord bending between two points at a given value. The Curvature 
Boundary Signature (Scb) represents the frequency of occurrence of the curva- 
ture of a boundary point at a given curvature angle. 



3.6 Measuring Signature Matching Performance 

Matching observed objects to model objects is accomplished by comparing their 
signatures using the four error metrics described in 0. These metrics compare 
the signature profiles based on signature distance, variance, spread and corre- 
lation. The four error metrics are then combined to form the signature match 
error (E), which gives a weighted signature error based on the four error met- 
rics. We will refer to the correct model that an object should match to as the 
model-match. 



Signature Recognition Rate and Recognition Efficiency. We define the 
signature recognition rate (d>) as the percentage of correct hypotheses found for 
a given set using a particular signature type. The signature recognition efficiency 
fq) is defined as the efficiency of a signature in matching a surface to its model- 
match surface for a given surface and is calculated by, 

7 ? = (1 - AMCI) ■ 100% (12) 

where AMCI is the average model-match percentile of a set. 
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Significance of Performance Measures. <f> and rj complement each other in 
measuring the signature matching performance for a given set of surfaces. While 
measures the percentage of surfaces that correctly produced a hypotheses, rj 
measures the efficiency of matching by taking into consideration how “far off” 
-in terms of matching rank- the model-match is from the hypotheses found. 



4 Experimental Results 

Tests were conducted on objects of different shapes and sizes. The model set 
consisted of the 41 objects shown in Fig. 5. Shape signature files for these models 
were constructed off-line and stored in the model database. Testing was done on 
two sets; The first set consisted of objects that are completely visible while the 
second set consisted of incomplete objects. A sample of the test objects is shown 
in Fig. 6. Boundary signatures for a sample model are shown in Fig. 7. 



ooooooo 






Fig. 5. Database models. 



Fig. 6. Sample objects. 



4.1 Matching Complete Objects 

Testing was done on 129 instances of the models at random scales and rotations. 
In addition 6 instances of 2 new models not in the database were also tested to 
observe if the system can indeed identify the closest model match to that which 
is perceived visually. 

Boundary signatures for these objects were constructed and subsequently 
matched to the model database. The results of matching are discussed in El 
and a summary follows: 

• Matching using Sdb produced the best signature matching where 124 of the 
135 objects were correctly hypothesized {^db = 91.9%). Matching using Spb 
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Fig. 7. Boundary signatures for model C. Top to bottom and left to right. Sab, Sdb, 
SpB, ScB- 



produced the second best matching rate {^pb = 63.7%). Matching using Scb 
and Sab produced the worst results {^cb = 15.6% and ^ab = 8.1%). 

• The signature efficiency using SpB and SpB are very high {rjpB = 99.1% 
and rjpB = 91.1%). This indicates that using Sbb produced -on average- 
the model-match as the best match or the next best match, with a higher 
tendency for the former. The signature efficiency using Sab and SpB are very 
poor (77 < 60%). 



4.2 Matching Incomplete Objects 

In this section we present the results of matching incomplete objects. Forty-one 
incomplete objects were tested. For these objects |i varied from p = 0.12 to 
p = 0.61 with an average value of p = 0.37 (recall that p denotes the amount of 
data missing). Fig. 8 shows a sample of these shapes. Matching results for these 
objects were as follows: 

• The overall success rate of the signatures were all very weak, where d) < 20% 
for all four signatures. Matching using Sdb produced the highest success rate 
{^DB = 19.5%) followed by Scs (d>ps = 14.6%). Matching using either 

or SpB produced the poorest results {^ab = 2.4% and = 7.3%). The 
performance of Sdb and Scb are much better when p < 0.3 ( 4 >pp ^<0.3 = 
66.7% and 4>pb„p.<o.3 = 16.6%). 

• However, the signature efficiencies varied from rj = 53.6% to 77 = 78.4%. 
SpB had the best performance with rjPB = 78.4% followed by Sbb with 
TlDB = 76.0%. Scb and S^b both had poor performances with rj = 65.9% 
and rj = 53.6%, respectively. 

• When p < 0.3 the signature efficiencies for both Sdb and Scb improve con- 
siderably {r]DB = 92.3% and ?7 cb = 77.7%). 

• From the points mentioned above, it is obvious that a correlation exists be- 
tween p and MCI (model-match percentile) for Sdb and SpB - The correlation 
coefficient calculated for the four signatures are 0.110, 0.764, 0.647 and 0.120 
for Sab, Sbb, Spb and Scb, respectively. These correlation coefficients indi- 
cate, as earlier observed, that a strong correlation exists between the amount 
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of boundary missing and the model-match percentile for Sob and SpB, and 
no correlation exists between the amount of boundary missing and the model- 
match percentile for Sab and Sqb- 



(nl s 

Fig. 8 . A sample of the incomplete objects. 



5 Discussion and Conclusion 

Table 1 gives a comparison of rj for the four signatures as function of |i, 

• The signature efficiencies (77) for both Sdb and Spp are sensitive to p as they 

degrade with increasing rj while rj for Sab and Sqb are not sensitive to p. 

• Up to p = 0.3 using Sdb outperforms all other boundary signatures. 

• Up to p = 0.6 the signature of choice is to use either Sdb or Spd. 

The fact that for both Sdb and Spp are sensitive to p while Sdb and Spp 
are not, are due to the fact that the former pair of signatures are dependent on 
the distance between boundary points. When an object is incomplete, the object 
takes on a new shape defined as the boundary of the visible part of the object. 
The boundary of the object has smaller inter-distance between its boundary 
points. As the amount of the boundary missing (p) increases the distortion 
in distance increases. On the other-hand, Sdb is a function of the boundary 
curvature, which retains its correct values on the visible portion of the original 
boundary. Sab is a function of the inter-direction between boundary points. As 



Table 1. Comparison of rj 



p 


Sab 


ScB 


Sdb 


SpB 


p = 0 


52.8% 


59.9% 


99.1% 


91.1% 


0 < p < 0.3 


58.2% 


77.7% 


92.3% 


76.8% 


0 < p < 0.7 


53.6% 


65.9% 


76.0% 


78.4% 
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long as only a non-significant part of the object is missing the inter-direction 
between boundary points is not greatly distorted. 

The high success rates obtained using the boundary signatures with the var- 
ious shapes described above makes the use of signature matching acceptable for 
character recognition. Furthermore, the boundary signature error described in 
this paper can be combined with our previous work to define a surface match 
error based on curvature, spectral and boundary attributes. With this added 
feature, the excellent results previously obtained using curvature and boundary 
attributes can be further enhanced. 
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Abstract. In this paper we propose a method for measuring the simi- 
larity between two images inspired by the notion of Hausdorff distance. 
Given two images, the method checks pixelwise if the grey values of one 
are contained in an appropriate interval around the corresponding grey 
values of the other. Under certain assumptions, this provides a tight 
bound on the directed Hausdorff distance of the two grey-level surfaces. 
The proposed technique can be seen as an equivalent in the grey level 
case of a matching method developed for the binary case by Hutten- 
locher et al. j2|. The method fits naturally an implementation based on 
comparison of data structures and requires no numerical computations 
whatsoever. Moreover, it is able to match images successfully in the pres- 
ence of severe occlusions. The range of possible applications is vast; we 
present preliminary, very good results on stereo and motion correspon- 
dence and iconic indexing in real images, with and without occlusion. 



1 Introduction 

This paper presents a general-purpose technique for matching arbitrary grey- 
level images, built around the concept of Hausdorff distance. The Hausdorff 
distance provides a useful measure for matching two sets of points, and its ver- 
satility is suggested by the diverse applications in which it appears, including 
defect detection P, gesture recognition P, robot localization P, range image 
analysis P, and content-based video and database indexing. 

Hausdorff measures have been used in computer vision nearly exclusively to 
match binary patterns of contours or edges. The work by Huttenlocher et al. 

P is the most representative here, and an apt springboard to introduce the main 
points and innovations of our technique. Huttenlocher et al. used the Hausdorff 
distance to implement an efficient search of a binary edge model M in a, binary 
edge image I. They fixed a (small) threshold p, say p = 1 pixel, and a certain 
fraction / of the model edge points, say / = 90%, and considered all possible 
translated versions Mt of M over I . Then, they built a dilated version of /, Jp, 
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by setting equal to 1 all pixels within p from each edge location in /. Candidate 
matches were indicated by translations, if any, resulting in Mt with at least a 
fraction / of the edge points contained in the corresponding region of the dilated 
edge image Ip. A problem of this method is that, if the edge density in I is high, 
the likelihood of a successful match against any Mt is also high. To obviate this, 
the same procedure was applied a second time after swapping image and model 
(similar to the left-right consistency constraint of stereo matching) . The authors 
also proposed extensions to allow for rotations and scaling of the model with 
respect to the image. 

Our method, which includes Huttenlocher’s in the special case of binary im- 
ages, is a true grey-level matching technique. It achieves tolerance to occlusions 
and significant versatility. No explicit numerical computations are necessary, so 
that the method is also suitable for high-speed implementations. Finally, using 
the rich information of the grey levels makes the disambiguating second step 
(matching after swapping image and model) largely redundant. 

The structure of this paper is as follows. Section 2 presents the necessary the- 
ory, Section 3 the method proposed. Section 4 experimental results with various 
applications. A discussion of our work is given in Section 5. 

2 Theoretical Background 

In this section we recall the main mathematical concepts behind our work, and 
discuss some geometric properties of Hausdorff distances. For simplicity, we re- 
strict attention to the case of IR^ . 

2.1 Hausdorff Distances 

Given two finite point sets A and B of the directed Hausdorff distance, 
h{A, B), is measured in two steps. For a fixed point a of A, the first step computes 
the distance of a from each point b oi B, and selects the distance between a and 
the closest point of B, da (see FigureOl left). The second step takes the maximum 
of da for all a of A, h{A,B) (see FigureEJ center). 

Formally this can be written as 



Note that order matters, since in general h{A,B) yf h{B,A). In the case of 
Figure [0 for example, h{B, A) <C h{A, B), because all points of B are “close” to 
some points of A. The directed Hausdorff distance, which is not symmetric and 
thus not a “true” distance, measures the degree of mismatch between A and B. 
To obtain a distance in the mathematical sense, symmetry can be restored by 
taking the maximum between h{A, B) and h{B, A). This brings to the definition 
of Hausdorff distance, that is. 



h(A, B) = max min lla — 611. 

aeA beB 



( 1 ) 



H{A, B) = max(h(A, B), h{B, A)). 



( 2 ) 
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Fig. 1. Directed Hausdor distance. Let the empty and lied dots be the elements of 
the sets A and B respectively. The distance of a point a of A from the set B is denoted 
by da (left); the maximum of da over all points of A is the directed Hausdor distance 
and is denoted by h{A,B) (center). The presence of a single outlier in A is su cient 
to distort h{A,B) signi cantly (right). 



Being a distance in the mathematical sense, the Hausdorff distance is zero if and 
only if A = B. Instead, the directed Hausdorff distance is zero if and only if A 
is a subset of B. For our purposes, a useful property of both measures is their 
ability to measure the distance between two sets with different number of points, 
while an undesirable property is their sensitivity to outliers (see Figure ^ right). 
The next section shows how this can be countered effectively. 



2.2 Geometric Interpretation 

One way to gain intuition on Hausdorff measures is to think in terms of set 
inclusion, see Figure El (left). Let Bp be the set obtained by replacing each point 
of B with a disk of radius p, and taking the union of all of these disks. Effectively, 
Bp is obtained by dilating B by p. The directed Hausdorff distance h{A, B) is 
not greater than p if and only if A Q Bp. This follows easily from the fact that, 
in order for every point of A to be within distance p from some points oi B, A 
must be contained in Bp. 

This geometric interpretation suggests an interesting property of the directed 
Hausdorff distance, useful to counter the effect of outliers. Let us call A the subset 
of the points of A contained in Bp. Assume that, for a given value of p, A is 
nearly equal to A (in Figure El right, there is only one element of difference). 
While the directed Hausdorff distance between A and B can be distorted by the 
few points not in A, h(A, B) is still not greater than p, which means that the 
potential outliers are defined and identified in one step. 



3 The Method 

In this section we present the method for finding a match of a grey level model 
M in & grey level image I, which can be classified among correlation techniques. 
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Fig. 2. Geometric interpretation of the directed Hausdor distance. Let again the 
empty and lied dots be the elements of the sets A and B respectively. Let Bp be the 
union of the set of disks of radius p centered at the points of B. From the left gure it 
is clear that A C Bp if p = h{A, B) and vice versa. The right gure shows an example 
of how this property, through the concept of partial inclusion, can be used to overcome 
the sensitivity to outliers. 



The idea is to search for a match by comparing all possible translated versions 
Mf of M with I and selecting the most appropriate translation t, if any. The 
method consists of three steps. 



1 . 



Expand the model M into the 3D binary matrix M, the third dimension 
being the grey value. That is, for i and j spanning the pixel locations and g 
the grey values: 






1 if M{i,j) = g] 
0 otherwise. 



Build the 3D binary matrix X from the image I in the same way. 

2. Dilate the matrix X by growing its nonzero entries by a fixed amount in all 
three dimensions. Let X' be the resulting 3D binary matrix. 

3. Finally, compute the size of the intersection between all possible translated 
versions M.t and X' , call it S{t). The candidate matches are identified by 
high values of S'(t), which can be thought of as a similarity surface. 



In our examples, we simply assign the best match to the translation i for 
which S{t) > S{t) for all t, that is to the absolute maximum of S{t), if the 
maximum is above a threshold r, which can be chosen in accordance with the 
level of similarity required. As shown in the next section, even with this simplistic 
choice results are very good. 

Three remarks are in order. First, the link between this method and the 
directed Hausdorff distance is as follows. If (a) the dilation of the matrix X is 
isotropic in an appropriate metric, and (6) S{t) takes on the maximum possible 
valuq3> then the directed Hausdorff distance between and X', h{Mf,X') is 
not greater than p. 



^ That is. Mi C T' 
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Second, with an appropriate choice of the data structures, the algorithm is 
reduced to a set of entry-wise logical AND operations between j\4 and X' and no 
numerical computations are required. 

Third and finally, unlike the 2D binary image case, the reverse matching with 
swapped patches is not necessary to eliminate possible spurious matches in the 
grey-level case, since the dense grey-level patches contain enough information to 
make the directed Hausdorff distance unambiguous. 

4 Experimental Results 

In this section we show some applications of this method to stereo and motion 
correspondence, and object search. 

4.1 Stereo and Motion 

The proposed method can be used to find correspondences in stereo pairs and 
motion sequences. Figure 01 shows a well-known stereo pair with its disparity 
map. To compute the disparity map of the left image, at each iteration, the 
model M is a neighborhood of each pixel of the left image, the image / is a 
subset of the right image, selected with the help of the epipolar constraint. To 
obtain the right disparity map, it suffices to swap left and right in the above 
description. Left-right consistency is performed, followed by a post-processing 
to fill the holes caused by multiple matchings and occlusions. 




Fig. 3. A stereo pair with its disparity map. 



We have also successfully used our matching technique as a feature tracker. 
Features have been extracted from the first frame. Each feature can be seen 
as a model, and tracked along the sequence by applying the matching method 
to each frame: Figure E] shows an example in which both the camera and the 
subject are moving. In the top row a region of the background, taken from 
frame 0 as model, is tracked for more than 40 frames (here we show frames 0, 
20, and 40). In the bottom row the head of a walking person is tracked for 15 
frames (the figure shows frame 0, 5, and 12); the target is lost at frame 14, 
when the subject walks into an area of shadow. Figure El shows an example 
where the image patch selected as a model in the first frame of a sequence is 
successfully tracked across the sequence, despite the appearance changes. The 
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Fig. 4. Stable region tracking of background (top row) and foreground motion (bottom 
row). 



figure shows frames 0 (containing the model) 20, 40, 60, 70, 85 (the last frame 
of the sequence) . The camera rotates around the statue, inducing a small image 
deformation (therefore, a dilation in all the three directions has been applied). 
The binary maps represent the distribution of positive scores (white pixels are 
the inclusion of A4 and I). In the whole sequence (86 frames) there has been 
only one error, which occurred at frame 70 (see figure), because its background 
was too different from the real model (the binary map on the left represents the 
error, the one on the right a possible satisfactory matching) . 



4.2 Iconic Search 

In this range of applications we are interested to search arbitrary models inside 
image databases. The model can be a region of interest of a specific image; for 
instance, a nose in a given image of a database of faces can be used to locate 
all the noses in the database. Here, the objects imaged are indeed human faces 
from the Olivetti face database^. Figure |H| shows an example. The pattern to be 
searched is a window around the subject’s right eye in the top row (left image), 
the center image shows the location of the best match found, by applying the 
search on the same image from which the model has been extracted. The map 
at the right represent a (equalized) similarity grey level description of the quan- 
tities S(t); each map pixel represents an image pixel: the whiter the map pixel 
intensity, the higher the value of the surface S, the closer is the neighbourhood 
of the image pixel to the model. The other two sections of Figure Elshow results 
of the eye search on various images of the face database. High surface values are 
scored systematically around the eyes. The absolute maximum misses the eyes 

^ A link to the Olivetti face database can be found at the Computer Vision Home 
Page. 
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Fig. 5. Sparse samples of a sequence with binary maps describing the pixel positive 
scores of the obtained matches. 



frequently when the subject wears spectacles, which is quite reasonable, since 
the presence of glasses modify the eyes pattern. 

Figures Q and 0 show occlusion experiments. Tolerance to occlusions is guar- 
anteed by the tolerance to outliers previously discussed, which is related to the 
idea of partial inclusion. Figure 0 shows the image from which the model has 
been extracted and matching results with a different face image for increasing 
level of occlusion. Notice that changing the shape of the occlusion the results 
change and also the “breaking point” is different. This is because the results 
strongly depend on the amount of information contained in the non-occluded 
area. In Figure 0, for instance, a tiny model of about 100 pixels is enough to 
obtain good results, as long as the model contains the fold of a nostril. 

Figure El shows a further example of how the “breaking point” changes with 
differently shaped occlusions. Of course, in practical applications it is more com- 
mon to have occlusions in the image rather than in the model. Our choice of 
occluding parts of the model though, does not affect either the method or the 
results, but simplifies the synthetic manipulation of the occlusions. 

5 Discussion 

We have presented a method for measuring the similarity between two images 
inspired by the notion of Hausdorff distance. With an appropriate choice of 
the data structures, the algorithm can be implemented very efficiently. Reverse 
matching, necessary to disambiguate in the binary case, is redundant here. 
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Fig. 6. Eye search in a database of faces (see text). 



The excellent potential of the method has been illustrated in a number of 
experiments, even though the current selection of the best match on the sim- 
ilarity surface exploits only a fraction of the information actually provided by 
the surface itself. A better use of this information is the subject of current work 
along with a more systematic performance evaluation. 
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Abstract. The matching of hierarchical relational structures is of sig- 
nificant interest in computer vision and pattern recognition. We have 
recently introduced a new solution to this problem, based on a maxi- 
mum clique formulation in a (derived) “association graph.” This allows 
us to exploit the full arsenal of clique finding algorithms developed in the 
algorithms community. However, thus far we have focussed on one-to-one 
correspondences (isomorphisms), and many-to-one correspondences (ho- 
momorphisms). In this paper we present a a general solution for the case 
of many-to-many correspondences (morphisms) which is of particular in- 
terest when the underlying trees reflect real-world data and are likely to 
contain structural alterations. We define a notion of an e-morphism be- 
tween attributed trees, and provide a method of constructing a weighted 
association graph where maximal weight cliques are in one-to-one corre- 
spondence with maximal similarity subtree morphisms. We then solve the 
problem by using replicator dynamical systems from evolutionary game 
theory. We illustrate the power of the approach by matching articulated 
and deformed shapes described by shock trees. 



1 Introduction 

The matching of relational structures is a classic problem in computer vision 
and pattern recognition, instances of which arise in areas as diverse as object 
recognition, motion and stereo analysis (see, e.g., |3)- A well-known approach 
to solving this problem consists of transforming it into the equivalent problem 
of finding a maximum clique in an auxiliary graph structure, known as the 
association graph |2|. The idea goes back to Ambler et al. has since 

been successfully employed in a variety of different problems. This framework 
is attractive because it casts relational structure matching as a pure graph- 
theoretic problem, for which a solid theory and powerful algorithms have been 
developed. Although the maximum clique problem is known to be iVP-hard, 
powerful heuristics exist which efficiently find good approximate solutions [Sj. 

In many computer vision problems, relational structures are organized in a 
hierarchical manner i.e., are trees. However, in standard association graph for- 
mulations, the solutions are not constrained to preserve this partial order. Hence, 
the extension of such techniques to tree matching problems is of considerable in- 
terest. We have recently introduced a solution to this problem by providing a 
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novel way of deriving an association graph, based on the graph-theoretic notions 
of connectivity and the distance matrix |7|. We have proved that in this new 
formulation there is a one-to-one correspondence between maximum cliques in 
the derived association graph, and maximum subtree isomorphisms. The frame- 
work has also been extended to handle the matching of trees whose nodes have 
one or more associated attributes, by casting the attributed tree matching prob- 
lem as an equivalent problem of finding a maximum weight clique in a weighted 
association graph. 

Whereas thus far we have focussed on one-to-one correspondences (isomor- 
phisms) and many-to-one correspondences (homomorphisms) between the struc- 
tures being matched m it is clear that these maybe overly restrictive assump- 
tions for problems where structural alterations may occur in both of the un- 
derlying trees, i.e., nodes have been deleted or additional nodes are present. In 
this paper we provide a generalization of the association graph framework to 
handle many-to-many correspondences (morphisms). We define a notion of an 
e- morphism (a many-to-one mapping) between weighted attributed trees, and 
provide a method of constructing an association graph where maximal weight 
cliques are in one-to-one correspondence with maximal similarity subtree mor- 
phisms. We then solve the problem by using replicator dynamical systems from 
evolutionary game theory. We illustrate the approach by matching articulated 
and deformed 2D shapes described by shock trees. 



2 Many-to-many Tree Matching 

Formally, an attributed tree is a triple T = (V,E^a), where (V,E) is the “un- 
derlying” rooted tree and a \ V — ^ is a function which assigns an attribute 
vector a{u) to each node u € V. Two nodes u,v € V are said to be adjacent 
(denoted u ^ v) ii they are connected by an edge. We shall also consider a 
function S : A ^ which assigns to each set of attributes (and therefore to 
each node in the tree) a real positive number. This will be interpreted as the 
negligibility of the corresponding node in the tree. Specifically, a node will be 
declared “negligible” if the value of the function 6 corresponding to its attributes 
is smaller than a fixed threshold e. This will allow us to associate a cluster of 
nodes (defined in the following) in the first subtree to a single node in the other 
one, thereby defining a many-to-one mapping from the first to the second tree. 

For technical simplicity, we shall transform our node-weighted tree into an 
edge-weighted derived attributed tree, by simply moving the <5- value associated to 
every node to the edge connecting it to its parent. The root of the original tree 
will become a child of a newly created dummy root, and its weight will be moved 
to the corresponding edge. In the derived tree, we shall speak of a “negligible” 
edge, when its weight is less than threshold e. 

Given a fixed threshold e > 0, we define the distance d^{u,v) between two 
nodes u and v in an attributed tree, as the number of non-negligible edges (i.e. 
with weight less than e) on the (unique) path from u to v. We define the function 
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leve(u) of a node in an attributed tree, as the distance from the root of the 
tree and node u. 

Given e > 0, we define an e-cluster in a derived attributed tree T as a subset 
of nodes of T such that for every two nodes u and v in it, we have: 

de{u, t>) = 0 . 

A set with only one node is a particular case of cluster that we call a singleton. 
It can easily be proved that given an e-cluster C, for all u,v € C, and z ^ C, we 
have: 



de(lt, Z) = de{v, z) . 

From this observation, the next proposition follows 0: 

Proposition 1. Given two e-clusters C and C" , for any pair of nodes, one in 
C and the other in C" , their e-distance is constant. 

This allows us to extend the notion of e-distance to pairs of clusters, which 
we shall denote by de{C , C”), and in turn to generalize the notion of adjacency 
to clusters. Specifically, two disjoint e-clusters C and C" in a derived attributed 
tree are said to be e-adjacent (denoted C C") if: 

d,{C',C") = 1 . 

It is clear that when C' and C” are singletons, this is equivalent to the traditional 
notion of adjacency between nodes. 

Finally, we are in a position to introduce the notion of a parent-child rela- 
tionship between pairs of clusters. Let C' and C” be two disjoint e-clusters in a 
derived attributed tree T. We say that C is an e-parent of C" when: 

C C" 



and 



lev,(C') < lev,(C") 

where lev£(C') is defined as df^{root{T),C). 

Now, let Ti = (Vi,Ei,ai) and T2 = (V2, A2, 02) be two attributed trees, and 
let M C Vi X V2 be any relation on Vi and V2. Define the sets Hi and H2 as 
follows: 



Hi = {u £ Vi : 3w £ V2 such that {u, w) G M} 



and 



H2 = {w £ V2 : G Vi such that {u, w) £ M} . 

Moreover, for each u £ Hi let M[u] be defined as: 



M[u] = {w G V2 : (u,w) £ M} 
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and, similarly, for each w G H2 let M[w] be defined as 

M[w] = {uGVi : (u,w) G M} . 

A relation M Q Vi x V2 is called a subtree e-morphism if the following prop- 
erties hold: 

The subgraphs induced by Hi and H2 are connected (i.e. are trees) ( 1 ) 

Vu G Hi, Vw G H2 : M[u] and M[w] are e-clusters ( 2 ) 

Vu, V G Hi : u is an e-parent of M[u] is an e-parent of M[v] , , 
Vw, z G H2 : w is an e-parent of 2 M[w] is an e-parent of M[z] ^ ' 

Clearly, in realistic applications, it would be desirable to find a morphism 
which pairs nodes having “similar” attributes. To this end, let a be any similarity 
measure on the attribute space, i.e., any (symmetric) function which assigns a 
positive number to any pair of attribute vectors. If M is a subtree e- morphism 
between two attributed trees Ti = {Vi,Ei,ai) and T2 = (V2, if2, 0:2), the overall 
similarity between the matched structures can be defined as follows: 

S{M) = ^ a{ai{u),a 2 {w)) 

{u,w)GM 

The e-morphism M is called a maximal similarity subtree e-morphism if we can- 
not add further matchings to M, while retaining the morphism property. It is 
called a maximum similarity subtree e-morphism if S{M) is the largest among 
all e-morphisms between Ti and T2. 

3 The Tree Association Graph 

Let u and v be two nodes of an attributed tree T, joined by path u = 
XqXi ■ . - Xn = V, and let x* be the node in the path nearest to the root of 
T; formally: x* = argmin,jg{„,^_ _^^jlev(a:). We define: 

v) = de{u, X*) 

Pe(u,v) = de(x*,v) . 

The weighted e-tree association graph (e-TAG) of two attributed trees Ti = 
{Vi,Ei,ai) and T2 = {V2,E2,a2) is the graph Ge = {V,E,ijj) where 



V = VixV2 



such that for any two nodes (u, w) and (v, z) in V 

, . , . ( nJu,v) = nJw,z) 

(u, w) ~ (u, z) < 7 ’ ' 7 { 

[Pe[u,V) = Pe[W,Z) 



(4) 
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and w is a function which assigns a positive weight to each node (u, w) G V as 
follows: 



Intuitively, condition imposes that the number of negligible ascent edges 
between u and v must be equal to the number of negligible ascent edges be- 
tween w and z. The same applies to descent edges. Notice that, when e = 0 we 
obtain the same association graph structure as originally studied for the tree 
isomorphism case [Z|. 

Given a subset of nodes C of V, the total weight assigned to C is simply the 
sum of all the weights associated with its nodes. A maximal weight clique in G is 
one which is not contained in any other clique having larger total weight, while 
a maximum weight clique is a clique having largest total weight. The maximum 
weight clique problem is to find a maximum weight clique of G jS]. 

The following result establishes a one-to-one correspondence between the 
attributed tree morphism problem and the maximum weight clique problem 
(see 0 for the proof). 

Theorem 1. Any maximal (maximum) similarity subtree e-morphism between 
two attributed trees induces a maximal (maximum) weight clique in the corre- 
sponding weighted e-TAG, and vice versa. 

Once the tree morphism problem has been formulated as a maximum weight 
clique problem, any clique finding algorithm can be employed to solve it (see jSj 
for a recent review). In the work reported in this paper, we used an approach 
recently introduced in m. which is summarized in the next section. 

4 Matching via Game Dynamics 

Let G = {V,E,co) be an arbitrary weighted graph of order n, and let Sn denote 
the standard simplex of K": 



where e is the vector whose components equal 1, and a prime denotes transpo- 



uj{u,w) = a{ai{u),a 2 {w)) . 



( 5 ) 



iSn = { X G IR" : e'x = 1 and Xi >0, i = 1 . . .n } 



sition. Given a subset of vertices G of G, we will denote by x”^ its characteristic 
vector which is the point in Sn defined as 




w(rti)/G(G), iiui G G 



0 



otherwise 



where G(G) = E„,-gC uj{uj) is the total weight on G. 
Now, consider the following quadratic function 



/(x) = x'Ax 



( 6 ) 
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where A = (a^) is the n x n symmetric matrix defined as follows: 




1 



if z yf j and Ui ^ Uj , 



2oj{ui) 

0 



( 7 ) 



1 



+ — s otherwise . 

2u{uj) 



2uj(ui) 



The following theorem allows us to formulate the maximum weight clique prob- 
lem as a quadratic program, thereby switching from the discrete to the contin- 
uous domain (see |0| for proof). 

Theorem 2. Let G = {V,E,lu) be an arbitrary weighted graph and eonsider a 
matrix A € A4{G). Then, the following hold: 

(a) A veetor x G Sn is a loeal minimizer of f on Sn if and only ifx = x'^, where 
G is a maximal weight elique of G. 

(b) A veetor x G Sn is a global minimizer of f on Sn if and only if x = x'^, 
where G is a maximum weight elique ofG. 

Moreover, all local (and hence global) minimizers of f on Sn are strict. 

We now turn our attention to a class of simple dynamical systems that we 
use for solving our quadratic optimization problem. Let W be a non-negative 
real-valued n x n matrix, and consider the following dynamical system: 



where a dot signifies derivative w.r.t. time t, and its discrete-time counterpart 



It is readily seen that the simplex 5'„ is invariant under these dynamics, which 
means that every trajectory starting in Sn will remain in Sn for all future times. 
Both (|2D and (0 are called replicator equations in evolutionary game theory, since 
they are used to model evolution over time of relative frequencies of interacting, 
self-replicating agents na 

Theorem 3. If W = W then the function x{t)'Wx{t) is strictly increasing 
with increasing t along any non- stationary trajectory x(t) under both continuous- 
time (ED and discrete-time m replicator dynamics. Furthermore, any such tra- 
jectory converges to a stationary point. Finally, a vector x G Sn is asymptotically 
stable under (ED and ® if and only if x is a strict local maximizer ofx'Wx on 
Sn- 

The previous result is known in mathematical biology as the fundamen- 
tal theorem of natural selection m and, in its original form, traces back to 
R. A. Fisher. Motivated by this result, we use (as in 0) replicator equations 
as a simple heuristic for solving our attributed tree matching problem. Indeed, 



ij(t) = x^{t) [{Wx(t))i - x{t)'Wx{t )] , i = l...n 



( 8 ) 




i = 1 . . . n . 



( 9 ) 
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note that replicator equations are maximization procedures while ours is a min- 
imization problem. However, it is straightforward to see that the problem of 
minimizing the quadratic form x'Hx on the simplex is equivalent to that of max- 
imizing x'( 7 ee' — H)x, where 7 is an arbitrary constant. Let T\ = {Vi, Ei^ai) 
and T 2 = (V 2 ,^' 2 )Ck 2 ) be two attributed trees, and let G = (V,E,uj) be the 
corresponding e-TAG. By letting 

W = jee' - A (10) 

where A is defined in © and 7 = max Uij, we know that the replicator dynamical 
systems (0 and 0 , starting from an arbitrary initial state, which is usually 
taken to be the simplex barycenter, will iteratively maximize the function x'Wx 
(and hence minimize x'Ax) over the simplex and will eventually converge to 
a strict local optimizer which will then correspond to the characteristic vector 
of a maximal weight clique in the e-TAG. This will in turn induce a maximal 
similarity subtree e-morphism between T\ and T 2 . 

5 An Example: Matching Shock Trees 

We illustrate our framework with numerical examples of shape matching. We 
use a shock graph representation based on a coloring of the shocks (singularities) 
of a curve evolution process acting on simple closed curves in the plane nni. 
Results on shape matching using edit operations between trees obtained from a 
related representation have also been recently reported in El- 

Shocks are grouped into distinct types according to the local variation of 
the radius function along the medial axis. Intuitively, the radius function varies 
monotonically at a type 1 , reaches a strict local minimum at a type 2 , is constant 
at a type 3 and reaches a strict local maximum at a type 4. The shocks comprise 
vertices in the graph, and their formation times direct edges to form a basis for 
tree matching. Each graph can be reduced to a unique attributed rooted tree, 
providing the requisite hierarchical structure for our matching algorithm. An 
illustrative example appears in Figure 13 The vector of attributes assigned to 
each node u € V oi the attributed shock tree T = (V,E,a) is given by a{u) = 
{xi,yi,ri,vi,0i ; ...; Xm, Vm, fm, Vm, 0m)- Here m is the number of shocks in the 
group, and Xi,yi^ri,Vi,9i are, respectively, the x coordinate, the y coordinate, 
the radius (or time of formation), the speed, and the direction of each shock i 
in the sequence, obtained as outputs of the shock detection process. 

In order to apply our framework we must define a measure of significance for 
each node (related to the e parameter) as well as a similarity measure between 
the attributes of two nodes u and v. The significance measure used here is the 
amount of boundary reconstructed by the shocks in a particular node, expressed 
as a fraction of the total boundary length. Intuitively, nodes are deemed less sig- 
nificant when they correspond to ligature-like regions and their associated shocks 
reconstruct only small portions of the boundary |2] . The similarity measure used 
is the same as the one described in 0 , but scaled by the average significance of 
the two nodes being matched. The measure provides a number between 0 and 
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Fig. 1. Left: Shock groups are identi ed on the skeleton using the notation identi- 
erishocktype. Right: The corresponding shock tree M- 



1, which represents the overall similarity between the geometric attributes of 
the two nodes being compared. The measure is designed to be invariant under 
rotations and translations of two shapes, and to satisfy the requirements of the 
weight function discussed above. 

We selected 24 silhouettes representing seven different object classes; the 
tools shapes were taken from the Rutgers Tools database. For each object class 
one or more prototype objects were chosen and were matched against all entries 
in the database using the clique-finding replicator equations described in the 
previous section. 

The selection of a proper value for the parameter e is clearly crucial for the 
performance of the matching process. Our intuition is that the larger the value 
of e, the larger the cluster formed by the matching process. In the experiments 
presented here e was set by trial and error to be equal to 0.004. 

We ranked the results using a score given by the quantity: 

X ^ an/Li) + ^ m/L,)}, (u) 

uGHi vGH2 

where Hi and H 2 are the sets of matched nodes in the first and in the second tree, 
respectively, W is the overall similarity between matched nodes (i.e., the weight 
of the maximal clique found), L\ and L 2 the total boundary lengths of each 
shape and l{u),l{v) the lengths of boundaries reconstructed by nodes u and v, 
respectively. The score represents the weight of the maximal clique scaled by the 
average of the total (relative) length reconstructed by the nodes in each tree that 
participates in the match. The top 8 matches are shown for each query shape, 
in Table 1. It is evident that the best matches are typically to instances from 
the same object class. These results represent an improvement with respect to 
those obtained for the isomorphism case [[j , and a slight improvement over those 
obtained for homomorphism on the same database ^ . Specifically, the many-to- 
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Table 1. A tabulation of the top 8 matches for several prototype shapes, with e = 0.004. 
The scores indicate the value of index m- 





Top 8 matches 


Prot. 


1 


2 


3 


4 


5 


6 


7 


8 


s 


\ 


\ 


/ 


! 


N*. 


✓ 


A 


V 




1.000 


0.764 


0.711 


0.615 


0.499 


0.459 


0.430 


0.421 


A 


/ 


\ 


\ 


! 


N*. 


• 


V 


A 




1.000 


0.711 


0.668 


0.607 


0.506 


0.428 


0.423 


0.421 


/ 


/ 


\ 


y 


V 


/ 




\ 


V 




1.000 


0.788 


0.784 


0.689 


0.643 


0.622 


0.528 


0.474 


/ 


/ 


\ 


/ 


/ 




V 


V 






1.000 


0.835 


0.784 


0.722 


0.577 


0.513 


0.496 


0.489 


V 


V 




V 


\ 


\4_ 


y 


/ 






1.000 


0.819 


0.749 


0.652 


0.593 


0.496 


0.474 


0.464 


V 


V 




V 


V. 


\ 


! 


s 


/ 




1.000 


0.762 


0.749 


0.681 


0.665 


0.447 


0.429 


0.423 




A 


V 


V 


\ 


\4. 


/ 


\ 


! 




1.000 


0.819 


0.762 


0.691 


0.670 


0.458 


0.449 


0.433 


\ 


\ 


✓ 


/ 




\ 


/ 


tr 






1.000 


0.508 


0.430 


0.402 


0.388 


0.354 


0.295 


0.295 


/ 


/ 


y 


\ 


/ 




V 


V 


\ 




1.000 


0.722 


0.715 


0.643 


0.545 


0.492 


0.407 


0.400 


Mr 


tr 


ir 




t 


✓ 


V 


V*. 






1.000 


0.904 


0.888 


0.427 


0.343 


0.330 


0.329 


0.321 






tr 




t 


✓ 


V 


s*. 


A 




1.000 


0.888 


0.843 


0.435 


0.347 


0.325 


0.311 


0.310 


• 


i 


t 


• 


ir 




/ 




s 




1.000 


0.535 


0.503 


0.446 


0.435 


0.428 


0.427 


0.395 


• 


t 


• 


t 


✓ 


ir 




tr 


/ 




1.000 


0.780 


0.535 


0.399 


0.300 


0.298 


0.293 


0.243 
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many matching algorithm tends to provide a sharper distinction between object 
classes. 

However, more experiments have to be done to assess the overall impact of 
the many-to-many matching algorithm, particularly when the database of shapes 
is much larger. We are currently carrying out such experiments, using a database 
of shock trees built using a novel algorithm for computing skeletal graphs CHI- 

6 Conclusions 

We have expanded our earlier work on matching hierarchical relational struc- 
tures to handle many-to-many mappings. This has been done by introducing 
a notion of e- morphism between attributed trees, and by constructing an as- 
sociation graph whose maximal weight cliques are shown to be in one-to-one 
correspondence with maximal similarity subtree morphisms. Computational ex- 
amples of matching articulated and deformed shapes give improved results over 
our earlier work m, because the framework now allows for negligible nodes in 
either one of the two trees to be ignored during the matching process. 
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Abstract. This paper casts the problem of perceptual grouping into an 
evidence combining setting using the apparatus of the EM algorithm. We 
are concerned with recovering a perceptual arrangement graph for line- 
segments using evidence provided by a raw perceptual grouping field. 
The perceptual grouping process is posed as one of pairwise relational 
clustering. The task is to assign line-segments (or other image tokens) to 
clusters in which there is strong relational affinity between token pairs. 
The parameters of our model are the cluster memberships and the pair- 
wise affinities or link-weights for the nodes of a perceptual relation graph. 
Commencing from a simple probability distribution for these parame- 
ters, we show how they may be estimated using the apparatus of the EM 
algorithm. The new method is demonstrated on line-segment grouping 
problems where it is shown to outperform a non-iterative eigenclustering 
method. 



1 Introduction 

Perceptual grouping is an important process which permits low-level features to 
be organised into higher level relational structures that can be subsequently used 
for scene understanding and object recognition. Broadly speaking, the available 
literature on perceptual grouping can be divided into three areas according to 
the level of abstraction at which they operate. At the lowest level of abstraction 
the available algorithms are concerned with computation of the grouping field. 
There are several contributions which deserve special mention Heitger and von 
der Heydt 0 have shown how to model the line extension field using directional 
filters whose shapes are motivated by studies of the visual field of monkeys. 
Williams and his co-workers mim have taken a different approach using the 
stochastic completion field. Here the completion field of curvilinear features is 
computed using Monte Carlo simulation of particle trajectories between the end- 
points of contours. At the intermediate level of abstraction, several authors have 
investigated the use of iterative relaxation style operators for edgel grouping. 
This approach was pioneered by Shashua and Ullman H2| and later refined by 
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Guy and Medioni among others. Parent and Zucker have shown how co- 
circularity can be used to gauge the compatibility of neighbouring edges |S|. 
Matas and Kittler 0 have shown how Waltz’s dictionary of line-label configura- 
tions can be used for junction re-enforcement. At the highest level of abstraction, 
the available algorithms pose the grouping problem as that of recovering a graph 
which represents the relational arrangement of segmental entities previously ex- 
tracted from raw image data. One of the most popular methods here is to use 
ideas from spectral graph theory to locate the salient relational structure. For 
instance, both Sarkar and Boyer CH and Perona and Freeman mu have used 
the eigenstructure of a perceptual affinity matrix to find disjoint subgraphs that 
represent the main arrangements of segmental entities. Finally, it is worth men- 
tioning that several authors have used similar algorithms based on eigenvalues 
of an affinity matrix to iteratively segment image data. One of the best known is 
the normalised cut method of Shi and Malik H3|. Recently, Weiss m has shown 
how this and other closely related methods can be improved using a normalised 
affinity matrix. 

The observation underpinning this paper is that although considerable effort 
has been expended at intermediate level to develop algorithms for combining 
evidence from a raw grouping field, the higher-level graph-based methods use 
static affinity relationships or relational abstractions as input. The aim in this 
paper is to develop a different approach to the problem which poses the recovery 
of the perceptual arrangement graph in an evidence combining framework. We 
pose the problem as one of pairwise clustering which is parameterised using two 
sets of indicator variables. The first of these are cluster membership variables 
which indicate to which perceptual cluster a segmental entity belongs. The sec- 
ond set of variables are link weights which convey the strength of the perceptual 
relations between pairs of nodes in the same cluster. Our contribution is to show 
how to estimate both sets of indicator variables using the apparatus of the EM 
algorithm. 

The outline of this paper is as follows. In Section 2 we develop a mixture 
model for the grouping problem. Section 3 shows how the parameters of this 
mixture model, namely the cluster membership probabilities and the pairwise 
link-weights can be estimated using the EM algorithm. In Section 4 we describe 
a simple model which can be used to initialise the link-weights. Section 5 pro- 
vides a sensitivity study on synthetic data and also furnishes some examples on 
real world images. Finally, Section 6 concludes the paper by summarising our 
contributions and offering directions for future research. 



2 Maximum Likelihood Framework 

We pose the problem of peceptual grouping as that of finding the pairwise clus- 
ters which exist within a set of objects segmented from raw image data. These 
objects may be point-features such as corners, lines, curves or regions. However, 
in this paper we focus on the problem of grouping line-segments. The process 
of pairwise clustering is somewhat different to the more familiar one of central 
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clustering. Whereas central clustering aims to characterise cluster-membership 
using the cluster mean and variance, in pairwise clustering it is link-weights 
between nodes which are used to establish cluster membership. Although less 
well studied than central clustering, there has recently been renewed interest in 
pairwise clustering aimed at placing the method on a more principled footing 
using techniques such as mean-field annealing |^. 

We abstract the problem in the following way. The raw peceptual entities are 
indexed using the set V. Our aim is to assign each node to one of a set of pairwise 
clusters which are indexed by the set J7. To represent the state of organisation 
of the perceptual relation graph, we introduce some indicator variables. First, 
we introduce a cluster membership indicator which is unity if the node i belongs 
to the perceptual cluster uj £ fl and is zero otherwise, i.e. 



1 if * S a; 

0 otherwise 



( 1 ) 



The second model ingredient is the link- weight Ai^j between pairs of nodes 
{i,j) € V X V — {(i, G V}. When the link-weights become binary in nature, 
they convey the following meaning 



Aij — 



1 ii i G u! and j G uj 
0 otherwise 



( 2 ) 



When the link-weights satisfy the above conditionm, then the different clusters 
represent disjoint subgraphs. 

Our aim is to find the cluster membership variables and the link weights 
which partition the set of raw perceptual entities into disjoint pairwise clusters. 
We commence by assuming that there are putative edges between each pair of 
nodes (i,j) belonging to the Cartesian self-product <l> = V x V — {{i,i)\i G V}. 
Further suppose that p{Aij) is the probability density for the link weight ap- 
pearing on the pair of nodes (z, j) G Our aim is to locate disjoint subgraphs 
by updating the link weights untill they are either zero or unity. Under the as- 
sumption that the link-wieghts on different pairs of nodes are independent of 
one-another, then the likelihood function for the observed arragement of percep- 
tual entities can be factorised over the set of putative edges as 



p{G)= n ( 3 ) 

We are interested in paritioning the set of perceptual entities into pairwise clus- 
ters using the link weights between them. We must therefore entertain the possi- 
bility that each of the Cartisian pairs appearing under the above product, which 
represent putative perceptual relations, may belong to each of the pairwise clus- 
ters indexed by the set [2. To make this uncertainty of association explicit, we 
construct a mixture model over the perceptual clusters and write 
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P{A,) = P{A,\co)P{co) (4) 

According to this mixture model, P{Aij\u>) is the probability that the nodes 
i and j are connected by an edge with link weight Aij which falls within the 
perceptual cluster indexed tu. The total probability mass associated with the 
cluster indexed uj is P{oj). In most of our experiments, we will assume that there 
are only two such sets of nodes; those that represent a foreground arrangement, 
and those that represent background clutter. However, for generality we proceed 
under the assumption that there are an arbitrary number of perceptual clusters. 
As a result, the probability of the observed set of perceptual entities is 

^(^)= n (5) 

To proceed, we require a model of probability distribution for the link- 
weights. Here we adopt the Bernoulli model 

= ( 6 ) 

This distribution takes on is largest values when either the link weight Aij 
is unity and = Sj^j = 1, or if the link weight = 0 and = Sjui = 0. 

3 Expectation-Maximisation 

Our aim is to find the cluster-membership weights and the link-weights which 
maximize the likelihood function appearing in Equation (5). One way to locate 
the maximum likelihood perceptual relation graph is to update the binary clus- 
ter and edge indicators. This could be effected using a number of optimisation 
methods including simulated annealing and Markov Chain Monte Carlo. How- 
ever, here we use the apparatus of the EM algorithm originally developed by 
Dempster, Laird and Rubin p. Our reason for doing this is that the cluster- 
membership variables must be regarded as hidden data whose distribution is 
governed by the link weights Aij. Since at the outset we know neither the asso- 
ciations between nodes and clusters nor the strength of the link weights within 
clusters, this information must be treated as hidden data. In other words, we 
must use the EM algorithm to estimate them. 

The idea underpinning the EM algorithm is to recover maximum likelihood 
solutions to problems involving missing or hidden data by iterating between two 
computational steps. In the E (or expectation) step we estimate the a posteriori 
probabilities of the hidden data using maximum likelihood parameters recovered 
in the preceding maximisation (M) step. The M-step in-turn aims to recover the 
parameters which maximise the expected value of the log-likelihood function. 
It is the available a posteriori probabilities from the E-step which allows the 
weighting of log-likelihood required in the maximisation-step. 
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3.1 Expected Log-Likelihood Function 

For the likelihood function appearing in Equation (5), the expected log-likelihood 
functions is defined to be 

g(g(.+D|^(n))^ ^ ^ P(zn|4”))lnp(4"+')M (7) 

where l<^) is the probability distribution for the link-weights at iteration 

n -I- 1 and is the a posteriori probability that the pair of nodes with 

link weight belong to the cluster indexed oj at iteration n of the algorithm. 

When the probability distribution function from equation o is substituted, then 
the expected log-likehood function becomes 



uj^f2 

where we have used the shorthand foi' tbe a posteriori cluster 

membership probabilities. After some algebra to collect terms, the expected log- 
likelihood function simplifies to 






.(n) ( 

I 



^iuj ^juj 



+ 1 ) 



In- 



4(^+1) 



1-4 



(n + l) 



+ ln(l- 



( 9 ) 



3.2 Maximisation 

In the maximisation step of the algorithm, we aim to recover the cluster and 
edge parameters Siuj and Aij. The edge parameters are found by computing 
the derivatives of the expected log-likelihood function. As a result the updated 
link-weights are given by 

AnPl) 2-^Lo£f2 

- 7^ rF) '■^9) 

In other words, the link-weight for the pair of nodes (i,j) is simply the average of 
the product of individual node cluster memberships over the different perceptual 
clusters. 

We use the soft-assign ansatz of Bridle | 2 | to update the cluster membership 
assignment variables. This involves exponentiating the partial derivatives of the 
expected log-likelihood function in the following manner 
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(n+l) 



exp 


[ 9Q(G<”+^>|G<'*b 




L 


J 






0Q(G("+1)|G( 









( 11 ) 



As a result the update equation for the cluster membership indicator variables 
is 



„("+!) 



exp 


W 

1 1 


A^) AA 1 A.j 




i^V ^_^{n + l) 

ij 


E^e^exp 


Y- AA „0)i„ A.j 


2^jev A,j,ui^jui i_ 4(”+i) 



n, 



i("+i) 






,(-) „(n) 



c? n 






i("+i) 



A^) Jn) 



(12) 



3.3 Expectation 



The a posteriori probabilities are updated in the expectation step of the al- 
gorithm. The current estimates of the parameters s("^ and are used to 

compute the probability densities p(A(”^|o;) and the a posteriori probabilities 
are updated using the formula 



^(^i4"^) 






(13) 



where is the available estimate of the class-prior P{lo). This is computed 

using the formula 

«^"^(‘")=u^ E ^(^i4”^) (14) 



4 Initial Line-Grouping Field 

We are interested in locating groups of line-segments that exhibit strong geomet- 
ric affinity to one-another. In this section we provide details of a probabilistic 
linking field that can be used to gauge geometric affinity. To be more formal 
suppose we have a set of line-segments £ = {4^; i = 1, ..., n}. Consider two lines 
Ai and Aj drawn from this set. Their respective lengths are li and Ij. Our model 
of the linking process commences by constructing the line Fij which connects 
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(a) 



(b) 



(c) 



Fig. 1. (a) Geometric meaning of the parameters used to obtain Pij ; (b) Plot showing 
the level curves; (c) 3D plot showing Aij on the z axis 



the closest pair of endpoints for the two lines. The geometry of this connecting 
line is represented using the polar angle 9ij of the line Fij with respect to the 
base-line Ai and its length pij . We measure the overall scale of the arrangement 
of lines using the length of the shorter line pij = min[/i, Ij] . The relative length of 
the gap between the two line-segments is represented in a scale-invariant manner 
using the dimensionless quantity 

Following Heitger and Von der Heydt 0 we model the linking process using 
an elongated polar grouping field. To establish the degree of geometric affinity 
between the lines we interpolate the end-points of the two lines using the polar 
lemniscate = fccos^ Oij. 

The value of the constant k is used to measure the degree of affinity between 
the two lines. For each linking line, we compute the value of the constant k which 
allows the polar locus to pass through the pair of endpoints. The value of this 
constant is 



k = 



Pi,j 



Pi^j COS 9i j 



(15) 



The geometry of the lines and their relationship to the interpolating polar 
lemniscate is illustrated in Figure^. It is important to note that the polar angle 
is defined over the interval 9ij G (— 7r/2,7r/2] and is rotation invariant. 

We use the parameter k to model the linking probability for the pair of 
line-segments. When the lemniscate envelope is large, i.e. k is large, then the 
grouping probability is small. On the other hand, when the envelope is compact, 
then the grouping probability is large. To model this behaviour, we assign the 
linking probability using the exponential distribution 

= exp[-AA:] (16) 



where A is a constant whose best value has been found empirically to be unity. 
As a result, the linking probability is large when either the relative separation 
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of the endpoints is small i.e. pij « pij or the polar angle is close to zero or tt, 
i.e. the two lines are colinear or parallel. The linking probability is small when 
either the relative separation of the endpoints is large i.e. pij >> pi^ or the 
polar angle is close to i.e. the two lines are perpendicular. In Figures ^b and 
c we show a plot of the linking probability as a function of and 9i j. 



5 Experiments 

In this Section we provide some experiments to illustrate the utility of our new 
perceptual grouping method. There are two aspects to this study. We commence 
by providing some examples for synthetic images. Here we investigate the sen- 
sitivity of the method to clutter and compare it with an eigen-decomposition 
method. The second aspect of our study focusses on real world images. 

The first sequence of synthetic images is shown in Figure 0 Here the fore- 
ground structure is an approximately circular arrangement of line-segments. 
In the first row of Figure 0 we show the arrangement of lines with increas- 
ing amounts of added clutter. In the subsequent rows we show the results of 
grouping. In each row the first image is the pattern of foreground line segments 
extracted by applying the eigendecomposition method described in [7J to the 
grouping field already detailed in Section 4. The second, third and fourth im- 
ages show the results obtained with the first, second and third iterations of the 
EM algorithm. Here the line segments are coded according to the value of the 
cluster membership weights where is the foreground cluster label. As 
the EM algorithm iterates, then so the foreground cluster weights tend to unity. 
In each case, the foreground cluster located by the EM algorithm contains less 
noise contamination than the result delivered by eigendecomposition. Moreover, 
none of the line segments leaks into the background. 

We have repeated the experiments described above for a sequence of synthetic 
images in which the density of distractors increases. For each image in turn we 
have computed the number of distractors merged with the foregound pattern and 
the numnber of foregound line-segments which leak into the background. Figures 
Ola and b respectively show the fraction of nodes merged with the foreground and 
the number of nodes which leak into the background as a function of the number 
of distractors. In both cases, the shoulder response curve for the EM algorithm 
occurs at a significanlty higher error rate than that for the eigen-decomposition 
method. 

Finally, we present results on real-world images. We have used the airplane 
and turtle images shown in Figures^ and e. The edges shown in Figures 03 
and f have been extracted from the raw images using the Canny edge-detector. 
Straight-line segments have been extracted using the method of Yin 0. The 
resulting clusters obtained with the EM algorithm can be seen in Figures 0: 
and g. For comparison, in Figures 01 and h we show the results obtained with 
the eigendecomposition method. In both cases the groiping obtained by the EM 
algorithm is cleaner and contains less spurious clutter. 
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Fig. 2. Top row: test patterns with 180, 200 and 250 random lines added respectively; 
second row: result of the applying eigendecomposition to the patterns in the top row; 
third, fourth and fifth rows: each row shows the first, second and third iterations of the 
EM approach to one of the patterns in the top row. 
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6 Conclusions 

In this paper, we have presented a new perceptual clustering algorithm which 
uses the EM algorithm to estimate link-weights and cluster membership prob- 
abilities. The method is based on a mixture model over pairwise clusters. The 
cluster membership probabilities are modelled using a Bernoulli distribution for 
the link- weights. We apply the method to the problem of line-segment grouping. 
Here the method appears robust to severe levels of background clutter. It is also 
relatively insensitive to the gap length and opening angles of the line-segments. 

There are a number of ways in which the method proposed in the paper 
can be improved. Presently, we use a soft-assign method to update the cluster 
membership variables. We are currently investigating whether this step can be 
rendered more efficient using matrix factorisation method along the lines sug- 
gested by Perona and Freeman HD|. 

References 

1. N. Laird A. Dempster and D. Rubin. Maximum-likehood from incomplete data 
via the em algorithm. J. Royal Statistical Soc. Ser. B (methodological), 39:1-38, 
1977. 

2. J. S. Bridle. Training stochastic model recognition algorithms can lead to maximum 
mutual information estimation of parameters. In NIPS 2, pages 211-217, 1990. 

3. G. Guy and G. Medioni. Inferring global perceptual contours from local features. 
International Journal of Computer Vision, 20(1/2):113-133, 1996. 

4. F. Heitger and R. von der Heydt. A computational model of neural contour pro- 
cessing. In IEEE CVPR, pages 32-40, 1993. 

5. T. Hofmann and M. Buhmann. Pairwise data clustering by deterministic annealing. 
IEEE Tansactions on Pattern Analysis and Machine Intelligence, 19(1), 1997. 

6. Matas J and Kittler J. Junction detection using probabilistic relaxation. Image 
and Vision Computing, ll(4):197-202, 1993. 

7. A. Robles Kelly and E. R. Hancock. Grouping-line segments using eigenclustering. 
In Proceedings of the British Machine Vision Conference, 2000. 

8. P. Parent and S. Zucker. Trace inference, curvature consistency and curve detec- 
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):823- 
839, 1989. 

9. Yin Peng-Yeng. Algorithms for straight line htting using k-means. Pattern Recog- 
nition Letters, 19:31-41, 1998. 

10. P. Perona and W. T. Freeman. Factorization approach to grouping. In ECCV, 
pages 655-670, 1998. 

11. S. Sarkar and K. L. Boyer. Quantitative measures of change based on feature 
organization: Eigenvalues and eigenvectors. Computer Vision and Image Under- 
standing, 71(1):110-136, 1998. 

12. A. Shashua and S. Ullman. Structural saliency: The detection of globally salient 
structures using a locally connected network. In Proc. 2nd Int. Conf. in Comp. 
Vision, pages 321-327, 1988. 

13. J. Shi and J. Malik. Normalized cuts and image segmentations. In CVPR, pages 
731-737, 1997. 



An Expectation-Maximisation Framework for Perceptual Grouping 605 



14. Y. Weiss. Segmentation using eigenvectors: A unifying view. In IEEE International 
Conference on Computer Vision, pages 975-982, 1999. 

15. L. R. Williams and D. W. Jacobs. Local parallel computation of stochastic com- 
pletion fields. Neural Computation, 9(4):859-882, 1997. 

16. L. R. Williams and D. W. Jacobs. Stochastic completion fields: A neural model of 
illusory contour shape and salience. Neural Computation, 9(4):837-858, 1997. 




Alignment-Based Recognition of Shape Outlines 
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Abstract. We present a 2D shape recognition and classification method 
based on matching shape outlines. The correspondence between outlines 
(curves) is based on a notion of an alignment curve and on a mea- 
sure of similarity between the intrinsic properties of the curve, namely, 
length and curvature, and is found by an efficient dynamic-programming 
method. The correspondence is used to find a similarity measure which 
is used in a recognition system. We explore the strengths and weaknesses 
of the outline-based representation by examining the effectiveness of the 
recognition system on a variety of examples. 



1 Introduction 



The representation of the shape of objects can have a significant impact on the ef- 
fectiveness of a recognition strategy. Shapes have been represented as curves HH 
point sets imM, feature sets m, and by medial axis yidl 1 71 1 iSI 1 41 
among others. This paper develops an approach to object recognition 
based on a curve-based representation of shape outline using the proposed con- 
cept of an alignment curve, and identifies the strengths and weaknesses of using 
curves to represent shapes for object recognition and for indexing into image 
databases by shape context. 

In many object recognition and content-based image indexing applications, 
the object outlines are represented as curves and matched. The matching relies 
on either aligning feature points using an optimal similarity transformation ^ 
or on a deformation-based approach to aligning the properties of the two 
curves 1 1112 112 Itil22| . Transformation-based methods rely on matching feature 
points by finding the optimal rotation, translation, and scaling parameters ^ 
C5I21. Deformation-based methods typically involve finding a mapping from one 
curve to the other that minimizes an “elastic” performance functional, which pe- 
nalizes “stretching” and “bending” The minimization problem in the 

discrete domain is transformed into one of matching shape signatures with curva- 
ture, bending angle, or orientation as attributes 1.511, The curve-based 
methods in general typically suffer from one or more of the following drawbacks: 
asymmetric treatment of the two curves, sensitivity to sampling, lack of rotation 
and scaling invariance, and sensitivity to articulations and deformations of parts. 
We address some of these issues in the proposed method. 

Another type of shape representation models the shape outline as point sets 
and matches the points using an assignment algorithm. Gold et al. use grad- 
uated assignment to match image boundary features. In a recent approach, Be- 
longie et al. Pj use the Hungarian method to match unordered boundary points. 
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using a coarse histogram of the relative location of the remaining points as the 
feature. These methods have the advantage of not requiring ordered boundary 
points, but do not necessarily preserve the coherence of shapes in matching. 

Shapes have also been represented by medial axis or its variants and then 
matched. Shock graph matching have been used in !i 7 |i g |i4| for object recog- 
nition and image indexing tasks. Zhu and Yuille m have proposed a frame- 
work (FORMS) for matching animate shapes by comparing their skeletal graphs. 
These approaches do not explicitly model the instabilities of the symmetry-based 
representations, which can be problematic when dealing with visual transforma- 
tions like occlusion, view-point variation, and articulation. Liu and Geiger m 
use the A* algorithm to match shape axis trees. Their algorithm does not pre- 
serve ordering of edges at nodes which can result in matches that do not pre- 
serve coherence of the shapes. Klein et al. imp have recently proposed an edit- 
distance based approach to shape matching, which is very effective, but like 
other graph matching techniques can in general be computationally intensive. 
This gives rise to the question whether the additional effort required in skeletal 
matching is justified by the improvements in recognition rates for particular ap- 
plications. A goal of this paper is to examine the effectiveness of outline-based 
matching techniques in general. 

In this paper, we present an outline-based recognition method, which relies on 
finding the optimal correspondence between 2D outlines (curves) by comparing 
their intrinsic properties, namely, length and curvature. The basic premise of the 
approach is that the goodness of the optimal correspondence can be expressed 
as the sum of the goodness of matching subsegments. This allows us to cast 
the problem of finding the optimal correspondence as an energy minimization 
problem, which is solved by an efficient dynamic-programming algorithm. We 
introduce the notion of an alignment curve to ensure a symmetric treatment of 
the two curves being matched. The problem formulation and the mathematics 
underlying the matching process is described in Section[3 In Sectional we discuss 
the proposed curve matching framework in application to shape classification 
and handwritten character recognition. In Section 0 we discuss some of the 
shortcomings and limitations of curve-based representation for recognition. 



2 Curve Matching 

This section first discusses the problem of matching and aligning two curve 
segments followed by a discussion pertaining to closed curves. Denote the curve 
segments to be matched by C(s) = (x(s), ?/(s)), s S [0,L] and C(s) = {x{s),y{s)), 
s € [0, L], where s is arc length, x and y are coordinates of each point, L is length, 
and where each is similarly defined for C. A central premise of this approach is 
that the “goodness” of the overall optimal match is the sum of “goodness” 
of the optimal matches between two corresponding subsegments. This allows 
an energy functional to convey the goodness of a match as a function of the 
correspondence or alignment of the two curves as proposed earlier in gE2- Let 
a mapping g : [0, L] — >■ [0, L], g(s) = s, represent an alignment of the two curves. 
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Fig. 1. (a) The cost of deforming an in nitesimal segment AB to segment AB, when 
the initial points and the initial tangents are aligned {A = A, T a = T a), is related 
to the distance BB, and is de ned by \ds — ds| + R\d0 — d6\. (b) The alignment curve 
allows for a nite segment from one curve to be aligned with a single point on one 
curve, thus allowing for the curve segment deletion or addition. 



Cohen et al. 0 use “bending” and “stretching” energies in a physical analogy 
similar to the one used in formulating active contours 0 in the form of 

m[ 5 ] = J I - C{s))\^ds + R J^{kc{s) - Kc{s)fds, 

where k is the curvature, i? is a parameter, and s = g{s). Younes [ 22 ] uses a 
similar functional. A key drawback of these approaches for recognition is that 
they are not invariant to the rotation of one curve with respect to the other, 
as the cost functional is a function of the absolute orientation of the curves. In 
addition, the issue of invariance to sampling has not been addressed. We now 
formulate the problem in an intrinsic manner which addresses both issues: 
Definition: Let C|j^ ^ j denote the portion of the curve from si to S2 and 

^l([si S2] [si S2]) restriction of the mapping g to [si, S2]) where si = g(si) and 
S2 = 5(52) • Define a measure g on this alignment function, 

l([si,S2],[si,S2]) ■ '^1 ([si,S2],[Sl,S2]) ^ ’ 



constructed such that it is inversely proportional to the goodness of the match, 
i.e., it denotes the cost of deforming to C\ 



.S2]‘ 

We restrict this measure p, to one which satisfies an additivity property, i.e., 



dm 



= d[g]\ 



■dim 



., where Si = g{si). This 



I ([si.S3],[si,S3]) l([si,S2],[si,S2]) I ([s2 ,83] , [s2 .SsD ’ 

property implies that the match process can be decomposed into a number 
of smaller matches, which in turn implies that it can be written as a func- 
tional = /o"' /^[ 5 ]|([ 3 ,.+d 3 ][g( 3 ),g( 3 +d 3 )])^«- T^eu, thc Optimal match 



is given by g* = aigmin 

Consider two infinitesimal curve segments and C|j^ of lengths ds, 

ds, and curvatures k, k, respectively. In our approach we only compare the 
intrinsic aspects of the curves. Thus, we can align the curves such that the 
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points A and A, and their tangents Ta and Tj coincide, Figure QJa). The cost 
of matching the infinitesimal curve segments is the degree by which B and B 
and their respective tangents differ, namely. 






([si,si-|-ds],[si,si-|-ds]) 



|ds — ds| -I- Rjd0 — d9\, 



where i? is a constant. Then, the resulting functional is given by 



m[5] = 



[\ 


ds 


+ R 


d9{s) ds d9{s) 


Ic 


ds 




ds ds ds 



( 1 ) 

( 2 ) 



The functional penalizes “stretching” and “bending” . However, this formulation 
of the curve matching problem is inherently asymmetric. This is precisely the 
objection raised by Tagare et al. (H to algorithms which are based on differen- 
tiable function of one curve to the other. They instead propose a “bimorphism”, 
which diffeomorphically maps a pair of curves to be matched, and which corre- 
sponds to a closed curve in space of Ci x C 2 . Specifically, they formulate a cost 
function that minimizes differences in local orientation change \d9 — d9\ along 
each differential segment of this curve, and seek a pair of functions (j)i and 4 > 2 , 
elements of the bimorphism, which optimize this cost functional. 

We approach this asymmetry issue in a somewhat similar fashion. Observe 
that the formulation allows for mapping an entire segment of the first curve 
to a single point in the second curve, but it is not possible to map a single 
point in the first curve to a segment in the second curve. This is because the 
notion of an alignment is captured by a (uni-valued) function g. To alleviate this 
difficulty we adopt a view where an alignment between two curves is represented 
as a pairing of two particles, one on each curve traversing their respective paths 
monotonically, but with finite stops allowed. Let the alignment be specified in 
terms of two functions h and h relating arc length along C and C to the newly 
defined curve parameter i.e., s = h{^), and s = h{^). In cases where h is 
invertible, we have s = h{h~^{s)) = h o h~^{s), which allows for the use of an 
alignment function, g = ho h~^, as before. However, when h is not invertible, 
i.e., when the first particle stops along the first curve for some finite time, g 
is not defined. While this formulation allows for a symmetric treatment of the 
curves, note that a superfluous degree of freedom is introduced, as in ca, because 
different traversals h and h may give rise to the same alignment. While Tagare 
et al. cni treat this degree of redundancy in the optimization involving two 
functions, we remove this additional degree of redundancy by proposing the 
notion of an alignment curve, a, with coordinates h and h 

= ee[0,L], a(0) = (0,0), a(Z) = (L,Z), 

where ^ is the arc length along the alignment curve and L is its length. The 
alignment curve can now be specified by a single function, namely, f G 

[0, L], where ip denotes the angle between the tangent to the curve and the x- 
axis, Figure GKb). The coordinates h and h can then be obtained by integration 



h{f)= cos{ip{r]))dT], h{f)= sin{iP{r]))dr], ^e[0,L]. 
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Fig. 2. (a) This gure illustrates the template that is used to in the Dynamic Pro- 
gramming implementation of Equation 0 The entry at (i,j) is the cost to match 
the curve segments xi,X2, ■ ■ ■ ,Xi and yi,y2, ■ ■ ■ ,yj d{i,j). To update the cost at (i,j) 
(blue dot) we limit the choices of the k and I, so that only costs at the a limited 
set of points (green dots) are considered, (b) This gure illustrates the grid used by 
the dynamic-programming method to compute the optimal alignment curve for closed 
curves. Discrete samples along the curves are the axes. The rst curve C is repeated. 
If the blue curves are optimal alignment curves from (si,si) to (si -|- n — l,Sm) and 
{sj, si) to (sj -I- n — 1, Sm), then the alignment curve from {sk, si) to {sk + n — 1, Sm) 
for i < k < j does not cross the blue lines, so the search can be restricted to the green 
area. Full details are discussed in PH. 



Note that tp is constrained by monotonicity {h' > 0 and h' > 0) to lie in [0, |]. 
The alignment between C and C is then fully represented by single function ip. 

The goodness of the match corresponding to the alignment curve can now 
be rewritten in terms of ip. First, if /i' 0 and h' ^ Q for ^ G [^ 1 ,^ 2 ], then 

g = ho h~^ is well defined and we rewrite g[tp] iu terms of g using Equation ^ 
which after some simplification results in 






Ki.?2] 



r?2 

- 



cos{tp) — sin('0)| -I- R\K{h) cos{tp) — n{h) sin(^)| 



di (3) 



Second, consider that one of h' or h' is zero at a point, say h' {£) = 0, implying 
that this point maps to a corresponding interval [h{£,),h{P, -b d^)]. The cost of 
mapping the point h{^) to the interval [h{£,),h{P, + d^)] is defined by enforcing 
continuity of the cost with deformations: consider the cost of aligning the interval 
[h{^), h{^ + d^)] to the interval [d(^), -b d^)] as the first interval shrinks to a 
point, i.e., as ip ^ cos{tp) — >■ 0, Similarly, the case where an interval in the 
first curve is mapped to a point in the second curve, should be the limiting case 
of ■!/;—>■ 0 or sm{ip) — >■ 0. Thus, the overall cost of the alignment ip is well defined 
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Fig. 3. Examples of the optimal alignment between curves obtained using the curve 
matching algorithm. The alignment is indicated by arbitrarily coloring portions of the 
aligned curves by identical colors with a number indicating the each portion’s end 
point. Observe that the alignment is intuitive for both open and closed curves. 



in all cases of Equation 0 and is found by minimizing 



= [ [I cos('i/') — sin(?/;)| -b i?|K(/i) cos(^/>) — k(/i) sin(i/;)|]d^, 

Jo _ _ (4) 

0 < V’ < T’ / cos('0)dC = and / sin(')/;)d^ = L. 

2 Jo Jo 

Then, the optimal alignment is given by ip* = argmin /^( t/)) | , . 

Definition: Let the edit distance between two curve segments C and C be defined 
as the cost of the optimal alignment of the two curves, d{C,C) = It is 

straightforward to show the following m- 

Lemma 1. If h* and h* specify the optimal alignment given hy ip*, the dis- 
tance function satisfies the following suboptimal property for < ^2 < ^ 3 , Si = 
h*{^i),s, = h*{f,),i= 1,2,3, 

+ ^(^I[32.33]’^I[S2.S3])‘ 

Matching Closed Curves: The edit distance between two closed curves is the 
minimum cost of the matching the open curve segments starting at si and si, 
and terminating at s* and having traversed the entire curve. 



d{Oclosed,Uclosed) — 



min d(C\ 

[siAi] 



[si,s 



l[si,s 



When matching closed curves, we do not have to find the alignment for all pairs 
of start point correspondences. It is sufficient to choose a start point si on curve 
C, and the find the optimal alignments for all possible start points on the curve 
C. If we choose another point S 2 > instead of si, we will get the same optimal 
alignment using Lemma ^ 

The curve matching is implemented using a fasl0 dynamic-programming 
method, as outlined in Figure El and described in detail in m- Figure El il- 
lustrates the alignment for two simple cases. In all the examples, we set R = 10. 

^ The complexity of the algorithm to match curve segments and closed curves is O(n^) 
and 0 {n^log(n)), respectively, where n is the number of samples along the curves. 
It takes 0.04 secs and 1.6 secs to match curve segments and closed curves with 50 
samples respectively on an SGI INDIGO^ (195MHz). 
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(a) (b) (c) 



Fig. 4. This gure illustrate the performance of the curve matching algorithm in pres- 
ence of an a ne transformation (a), view-point variation (b) and articulation and 
deformation of parts (c). The alignment is indicated by arbitrarily coloring portions 
of the aligned curves by identical colors with a number indicating the each portion’s 
end point. Observe that the matches are intuitive, e.g., hands, legs and head of the 
dolls correspond in the presence of articulation, stretching and bending. Note that the 
di erent views of the kangaroo were obtained by taking snapshots of a 3D model. 



We have also seen experimentally that the alignment is relatively insensitive to 
the choice of R. 



3 Recognition Using Shape Ontline Alignment 

In this section, we examine the effectiveness of curve matching for recognizing 
shape outlines and characters. The curve alignment framework gives a corre- 
spondence between two curves, which is then used to measure the similarity 
between two curves. One can either use edit distance or normalized Euclidean 
distance between corresponding points mm- For curve matching to be effective 
in object recognition, it has to perform well under a variety of visual transfor- 
mations such as occlusion, articulation and deformation of parts, and view-point 
variation, which we examine now. Figured shows that the curve matching algo- 
rithm works well in the presence of commonly occurring visual transformations, 
affine transformations, modest amounts of view-point variation, and under some 
articulation and deformations like stretching and bending of parts. 

Object Recognition: We illustrate the use of curve matching for shape clas- 
sification on a database of 36 shapes. The database consists of shapes from six 
different categories (fishes, tools, planes, rabbits, “greebles”, and hands) with six 
different shapes in each category. Comparisons are made between every pair of 
shapes. The five nearest neighbors for each shape are highlighted. Observe that 
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Table 1. Costs of matching pairs of shape outlines in a database of 36 shapes, 6 samples 
of each of 6 categories. The five nearest neighbor are from the “correct” category 36/36, 
35/36, 35/36, 33/36, 27/36 cases. 
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the shapes are categorized intuitively, i.e., the nearest neighbors of the “tool” 
shapes are in the “tool” category and similarly for others. 

Handwritten character recognition: As another example we have selected 
handwritten character recognition which due to its inherently one-dimensional 
nature, is well suited to this approach. As in the case of recognizing shape out- 
lines, the optimal alignment between two characters is found and then used to 
compute a distance measure between the two. We have used a database of 88 
digits consisting of 6 different characters, to perform recognition experiments. 
Matching is done between every pair of characters in the database, and the top 
25 matches of a few sample characters are shown in Table El Observe that in 
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Table 2. The top 20 matches for a few handwritten digits. The database used in this 
experiment consists of 88 handwritten digits. The number below the matching character 
is the computed distance. Observe that most of the top matches of a character are 
samples of the same character, i.e., the top matches of “2” are samples of “2”. 
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Table 3. The top five matches for a few sample characters are shown. The number 
below each match is the computed distance measure between the two characters. 
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most cases, the top matches are other samples of the same character, indicating 
the potential of this approach in handwritten character recognition. 

Prototype formation: Prototypes have been used to improve the efficiency of 
object and character recognition ^ and indexing into image databases. Typi- 
cally, a representative sample is used as the prototype. Instead, an “average” 
curve can be used. The curve alignment framework allows us to generate the 
average of set of curves m- Figure El shows the average curve for a set of fish 
outlines, and handwritten digits. The average outline of handwritten characters 
are used in the handwritten character recognition experiments with excellent 
results. For 327 digits and alphabets (34 categories) written by one subject, 323 
characters (98.8%) were correctly recognized. The top five matches for a few 
sample characters are shown in Table 0 

Morphing: Morphing a shape to another has a variety of applications in com- 
puter graphics and animation. The proposed curve matching framework can be 
used to generate a sequence of images morphing a shape to another when the 
shapes are not very dissimilar. Figure El shows the morphing of the outline of 
a cat to that of a kangaroo. Curve matching has also been used in a variety 
of other applications including tracking objects in a video sequence, comparing 
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Fig. 5. Top and middle rows: A collection of curves (blue) and their average (red). 
Bottom row: Sequence of deforming the outline of a cat to that of a kangaroo. 




Fig. 6. (a) Curves do not represent the interior of the shape, and hence cannot ade- 
quately distinguish between perceptually distinct shapes whose outlines have similar 
features, (b) This also implies that in relating two shapes by curve matching, outline 
features take precedence over matching the shape interior! Curve matching aligns the 
wavy sides of the two squares, ignoring the spatial con guration as a square. 



medical structures, registering 3D volume datasets by aligning characteristic 3D 
space curves like ridges. Thus, our proposed scheme can be useful in numerous 
applications. 



4 Discussion and Conclusion 

We have presented a computational framework to find the optimal correspon- 
dence between two 2D curves. The main contribution of this paper is to propose 
a new scheme for curve matching that is symmetric in its treatment of the 
two curves, is highly efficient, and works well in a variety of computer vision 
applications including shape classification, hand-written character recognition, 
prototype formation, and morphing. The optimal correspondence is computed 
by using the concept of an alignment curve and due to the use of intrinsic prop- 
erties is invariant to rotations and translations and gives the intuitive matches 
in the presence of visual transformations like viewpoint variation, articulation 
and occlusions of limited extent. 
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Fig. 7. This gure illustrates the sensitivity of the curve matching to spatial arrange- 
ment of parts (a, b, c, d) and to occlusion (e). Top row: We perceive two ellipses with 
protrusions. The larger protrusion is matched correctly, as it lies on the same side of 
the ellipse. However, the smaller protrusion is matched incorrectly as it lies on opposite 
sides of the ellipse. Middle row: The correspondence of the shes in (b) is incorrect, 
as a n on the sh on the left is matched to the head of the sh on the right. This 
incorrect match is because there is an extra n in the sh on the left, (c) illustrates that 
the correspondence is intuitive when the n on the correct side is removed. Bottom 
row: The missing nger and the small bump of the hand on the right causes the curve 
matching to give the un-intuitive match (d). (e) shows a case where curve matching 
gives the un-intuitive correspondence for similar shapes in presence of occlusion. Part 
of the tail of sh on the right (shown by the box) is occluded in this case. 



We have studied the effectiveness of curve matching for shape matching and 
classification, especially in comparison to our group’s work on shock graph-based 
methods [Klhj . and evaluated its strengths and weaknesses. While the full com- 
parison is beyond the scope of this paper, we summarize the main points of dif- 
ferences below m- The major advantage of curve matching is its computational 
efficiency. We have shown that with our proposed curve matching method we 
can achieve acceptable recognition rates for shape matching even under a range 
of visual transformations while maintaining computational efficiency. However, 
we have identified a number of areas where curve matching fails for 2D shape 
recognition. The first shortcoming of curve-based representation is that they do 
not represent interior of the shape. Hence, curve matching cannot easily dis- 
tinguish between some perceptually distinct shapes when the local curve-based 
features are in conflict with the global shape percept. Figure El Another draw- 
back of curve representation and hence curve matching is the sensitivity to the 
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presence and spatial arrangement of parts. Figure |7| shows examples where curve 
matching gives the un-intuitive correspondence when the parts are arranged dif- 
ferently. Curve matching works well in the presence of occlusion, if it does not 
affect the overall part structure of the object. When the occlusion adds or deletes 
a part, curve matching can fail, as shown in Figure [3^e). 

In conclusion, curve-based representation is the natural choice in handwritten 
character recognition and in other applications where the data is inherently one- 
dimensional. Also, for shape recognition, prototype formation and morphing 
where the variation in shape does not alter the part structure, curve matching 
works well. However, in the presence of large scale variations of the outline 
resulting in changes in the part structure, curve matching can fail, and more 
comprehensive representations which explicitly represent the shape interior, such 
as skeletal graphs are necessary despite their relatively higher computational 
cost. 
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Abstract. The paper describes a method for analysing a sequence of images 
by building static images, representing the environment on which shapes 
move. From the background and moving objects it is possible to reconstruct 
the original image sequence as well as to generate new ones. The analysis 
uses a linguistic interface that allows to express the semantics of video’s. 
Both background and movement analysis allows to extract the shapes 
contained in a video. The description of video shapes and of their spatio- 
temporal properties is performed by a Prolog program; so the program 
using facts describes the Syntax of the video’s, while the layout of 
predicates contains the description of the semantics. Then the content of a 
‘video-base’ may be extracted: the approach uses a prototype film, whose 
description is used as a dynamic query for automatic extraction of other 
film semantics. 



1 Introduction 

Visual representation is characterised by richness of semantics[2] and in particular 
filmic representation corresponds to long and complex linguistic description. Video’s 
stored in large data bases have problems for retrieving their content[4] . This content 
is hard to retrieve when the images are static and becomes a challenge in a dynamical 
context and when the image number is various hundred. In this case the involved 
information is complex and the access to it is normally hard and slow. In fact various 
researchers attempted to extract from a sequence of frames the content. The aim of 
such a research is to associate each film to a global description that constitutes a way 
to index[15] it in a database. The problem of build an interface for multimedia 
content description is related to standard MPEG. In fact the Moving Picture Expert 
Group[13] actually is working to a new standard, named MPEG-7 [14], which deals 
with a 'multimedia content description interface'. This standard concerns especially 
multimedia descriptions, schema of descriptions and description definition languages. 
From this standard a series of characteristics born like the followings: a description 
must be apt both to user and application; different people may produce different 
analysis of the same scene, as well as may give different interpretation of the same 
event. A description has a proper granularity: too general description results in detail 
loss, while too detailed one losses important general concepts. Besides some details 
may interest some peoples and not other one. MPEG-7 document not includes an 
automatic extraction of descriptions and says nothing about a browser for showing 
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these descriptions. The paper focuses on two aspects of the interface. Firstly it takes 
care of the extraction of a compact representation [7]of a frame series and on a 
description based on little number of features [16]. Secondly It defines a query 
language constituted by a set of predicates and facts; these last are derived by a series 
of features automatically extracted from the frame-sequence. 



2 Background-Based Video Analysis 

Mosaic technique and movement analysis allow to extract all the Video Objects (VO) 
contained in a film. The description of VO and of their movements is made through 
the automatic layout of a Prolog program. So Prolog programs using facts describes 
the Syntax of the video’s, while the layout of predicates contains the description of 
the semantics. It is possible to refer a dynamic shape to background and to define a 
mobile point of view that in simulates the TV-Camera movement. A background may 
be built by putting, side by side, different frames in order to build a panoramic scene 
[10]. The union of frames allows to reconstruct the environment on which the film 
was run. The information on a video is redundant because each point of the scene is 
repeatedly shown in consecutive frames. The data shared by all frames constitute the 
data of global scene. Mosaicing allows to transform a frame-based representation in a 
background-based representation. The spatial information and the temporal 
information constitute the base of analysis. Besides while the visual field of a single 
image is restricted the camera moving extends the field that will be present in the 
background. 

If the background includes the spatial information the moving objects must be 
removed from the scene when we are searching for spatial Information because they 
cover the background. Some objects may obscure the background during the the film 
leaving the mosaic not complete. On the contrary In some frames the cover 
background is shown and then promptly captured. Similar considerations may be 
valid for temporal information that is not linked to movement. The camera movement 
is basic for frame join The assumptions for movement analysis are the following: - 
input is constituted by a series of frames; - each frame represents an image of the 
scene at a given instant; - each image is a projection of the scene depending on 
camera position[19]. - the change in a scene is due to the camera movement and to the 
displacement of shapes, that are almost rigid. The system reveals the change in a 
scene and from these detect the movement of the observer and of the shapes. They are 
three dynamic situations: 1) Stationary Camera, Moving Object - SCMO; 2) Moving 
Camera, Stationary Object-MCSO; 3) Moving Camera, Moving Object - MCMO; 
SCMO is valid for surveillance applications, while the other two are more versatile 
and are used in navigation systems and recently in the video indexing. There is also, 
in literature, the subdivision of the dynamic scene analysis in two steps [12]: 
Perceptive; Cognitive. 
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3 Background Extraction 

The background extraction needs four steps: l)movement detection; 2) movement 
model definition and computation of dynamic parameters; 3)cancellation of moving 
objects respect to background; 4)alignment and building of the background. 

Movement detection. Movement detection using gradients is accurate only if the 
difference between frames is a pixel fraction so the implicit Taylor approximation is 
acceptable. Besides the derivatives become critic in case of fast movement and often 
it will be necessary pyramidal multi-resolution [1]. Methods based on features [16, 18] 
often depend on type of images. For instance the corner method applied to smooth 
and circular shapes. In addition often in these method the correlation is used for 
corrispondence detection [5]. For these reason we have chose for movement 
detection a correlation method. We consider here the image divided in blocks and the 
movement is analysed through the correlation between blocks; in particular we use a 
variant of the BMA[13] algorithm. The correlation here is computed inside a window 
including the block and this solution overcomes the speed problem: the film speed 
may increase as the window size increase. Besides the BMA algorithm allows to 
consider images distant 3 or 4 frames. The reference-blocks are built in greed as in 
figure 1. 




Fig. 1. The reference-blocks. Each point represents the centre of a block. 

In this algorithm each point is candidate for being followed and they are not searched 
point with particular characteristics as in other approach[ 17,22]. The used search 
method is the FSA with a similarity method MSB . In the example of figure 1 they are 
considered blocks of 5x5 pixels and a maximum size of the speed vector equal to 20 
points in vertical and in horizontal direction. The movement vector is obtained by 
searching, with the similarity measure MSB, the current block in a window 20x20 in 
the reference-frame. The BSA algorithm is computationally simple and requires 40^ 
analysis of matching criterion. The MSB measure is computed with the formula: 
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M-W-1 

MSE= ^ (/„(A:+i,/+7)-/„_j(A:+/+M,/+7+v)) 

/=0y=0 

To couple points of two images we will search the minimum of MSE. In some 
cases the method gives results ambiguous. In general points where the matching is not 
clear are refused. After the correspondences, the detection of optical flow [8], that is 
the speed v of the blocks, is computed by the following formula: 

V oc V(-*2 - + (y 2 - yiY 

where (xi, yi) and (X 2 y 2 ) are the co-ordinates of the same block in different frames. 
The optical flow will be represented by vectors departing from the source-frame and 
with orientation, versus and length capable of connecting the destination block in the 
destination frame. 

Movement Model. The choice of the model implies using pre-defined knowledge, or 
models^’^°, to obtain the right interpretation. Normally we introduce the affine 
model[20] with 6 parameters. By considering three points we obtain six equations. 
The solutions may be influenced by: - the quantisation noise and imperfect-matching 
noise; - the 2D nature of the affine model; - the presence of autonomously moving 
objects. 

If n are the corresponding points we will write n systems with 2n equations. 
Unfortunately the previous reported problem produces not sound equations. A 
solution is obtained by minimising the error between real position in the destination 
frame and computed position, using movement parameters. If |J, is the mean of error 
distribution and a the variance, the blocks that obey to the formula: 
Err > fx + a be considered not background blocks. All points with error 

higher of the previous value will be abandoned in the next calculation. The discarded 
point will be considered as part of those objects with proper movement, or point with 
bad value of speed due to accumulation of error. The algorithm detects the dominant 
movement and the areas with discordant movement. Figure 2 shows the computed 
optical flow. The circled points are the points whose movement is discord ( south- 
east ) with the dominant one (east). After the system computes the difference between 
actual and preview frame. Since this operation compensate the background 
movement, the result is constituted only by the discordant points. 

Cancellation of objects with autonomous movement. Some methods of object 
cancellation [11,21] attempt to group frame zones with coherent movements. Here we 
attempt: - to detect the background, which represents the environment where the 
scenes are realised; -to detect the objects with movement independent from the 
background. The difference criterion, for this aim, used by M. Irani, B. Russo e S. 
Peleg [11], is not enough robust and they are revealed zones without movement. In 
the method of J.Y.A. Wang and E.H. Adelson [21] the surface are detected, in optical 
flow images, only by similarity. In the previous images we can see that points are 
miss-classified as background: for instance some point of the guardrail. Our choice 
classify as not background blocks detected by both methods: - by dissimilarity of 
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Fig. 2. The optical flow. 




Fig. 3. Masked objects that move respect to background. 

movement from that one of tv-camera: - by difference between images, after 
realignment of actual with preview frame. A final computation concerns the grouping 
of blocks in a single object: this is obtained by neighbouring and connection criteria: 
in fact sometime parts belonging the same object have the same movement. We 
hypothesise that point of the same object are neighbouring or connected : the 8- 
connection is used. The mosaic construction requires masking the blocks of moving 
objects, by black zones, as in figure 3. Given two consecutive frames the mosaic will 
be constituted by the pixels of the second frame; if some pixels are masked we go 
back to the first frame. If they are still masked pixels, they are supplied by the old 
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mosaic, and so on. This method allows up-to-date information; during the time new 
zones will be included in the background. 



4. Linguistic Semantic Interface 

The analysis of image sequences pass trough a linguistic interface that allows to 
express in simple manner the semantics of the video’s. This semantic linguistic 
interface allows to analyse a reference film and compare it with other films in order to 
find those with similar semantics. In practice a given film may constitute a video- 
query on a film DB. The possibility of referring all objects in a scene to a background 
allows to describe the scenes in terms of movement parameters and in terms of 
spatial relations. 

The extractor of video descriptions. We can now extract a series of features 
concerning the background and the moving shapes. These features are converted in a 
database of facts. The video is analysed by extracors that extract some characteristics 
and describe them by means of facts in a database. Figure 4 shows the schema of an 
extractor. Prolog-facts represents features of the video and in general they are more 
facts for the same particular. The set of all facts represents the syntactic description 
of the video. They will be added to the database concerning all analysed video. Some 
procedures including predicates will specify the semantics of relationships among 
elements derived by the extractor. Particular instances obtained by inferential engine 
of Prolog will constitute the description of the video. 

The choice of Prolog as description language depends on the facility in defining 
new predicates and on its commercial diffusion. The inferential engine of the 
language allows in addition interpreting the descriptions. At the end the possibility for 
the user to define new predicates and new rules allows to introduce personal 
primitives, interpretations and the right description granularity. Submitting proper 
queries that in our case assume the aspect of a 'goal' may test the Prolog knowledge 
base so built. All the 'unification's' that satisfy the posed goal will be the descriptions 
that is the result of the navigation in the knowledge base. The movement may be 
checked by distinguishing the displacement of the camera and by separating from the 
background the shapes with autonomous movement. This separation allows to show 
characteristics of the detected shapes. For instance we can axaminate the path by 
computing frame by frame the mass-centre and refer it to co-ordinates of the frame. 
But if we know the movement of each frame respect to next one we can refer the 
mass-centre to the panoramic background. For each piece of video the fact descriptor 
produces a predicate of the type: 

video(Number, Name, Frames). 

Where number is an integer that identify the video in the data base, name is the name 
given to the video and frames is the number of frames that constitute the piece. For 
each frame a descriptor produces facts of the type: 
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Fig. 4. Extractor of video descriptions 

mass-centre (Video, Frame, Shape, X,Y) 

where video stand for video under analysis, frame for number of current frame, X and 
Y are the co-ordinate of the reference point in the frame. Another predicate describes 
the transformation from a frame to another and reconstructs the camera movement: 

affine(A,B,C,D,E,F,F rame,Video) 

where the letters from A to F allow to identify and reconstruct Frame from its 
preceding one. Predicates that describe frame-by-frame the object movement may be 
the followings: 

fsouth(V,F,S,X,Y,Fl,Xl,Yl):-mass-centre(V,F,S,X,Y), 
mass-centre (V,F1,S,X1,Y1), 

Y>Y1,F<F1. 

fnorth(V,F,S,X,Y,Fl,Xl,Yl):- mass-centre (V,F,S,X,Y), 
mass-centre (V,F1,S,X1,Y1), 

Y<Y1,F<F1. 

The following predicates express the position of a shape respect to the mosaic and the 
affine-transformation from mosaic and frame under analysis: 

position(S,X,Y,T,V):-mass-centre(Video,Frame,Shape,X,Y), 

video(Video,Name,Mosaic), 

globalaffine(A,B,C,D,E,F,Mosaic,Frame,V), 

X is integer(XlH-A+B*Xl+C*Yl), 

Y is integer(Yl+DH-E*Xl+F*Yl). 
globalaffme(A,B,C,D,E,F,End,End,V):-A is S, B is S, C is S, 

D is S, E is S, F is S. 

globalaffine(A,B,C,D,E,F,End,Start,V):-End\=Start, 

T2 is End-1, 

globalaffme(Al,Bl,Cl,Dl,El,Fl,T2,Start,V), 

affme(A2,B2,C2,D2,E2,E2,T2,V), 

A is A1-HA2, B is B1 -hB 2, C is C1+C2, 

D is D1 -hD 2, E is E1+E2, F is F1 h-F2. 

We can now write predicates to describe the movement inside the panoramic image. 
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south(V,F,S,X,Y,Fl,Xl,Yl):-position(S,X,Y,F,V), 

position(S,Xl,Yl,Fl,V), 

X<X1,T<T1. 

The temporal relations of movement may be described by predicate like come and go: 
come(Ogg,Oggl,F,Fl,V):-distance(Obj,Objl,D,F,V), 

distance(Obj,Objl,Dl,Fl,V), 

F<F1,D>D1. 

go(Ogg,Oggl,F,Fl):-distance(Obj,Objl,D,F), 

distance(Obj,Objl,Dl,Fl), 

F<F1,D<D1. 

Other temporal relations are those between events, like before, after, contemporary, 
that are described by order relations of the type: F>F1, F<F1, F=F1. 

The predicate path describes in detail, using previous facts, the journey of objects. 

The camera movement is described by predicates of the type: pan and tilt. 

Results from queries. The video of the example previous analysed that is named 
’car’, is composed by 56 frames of 300x200 corresponding to 2 sec. Of vision, 
sampled each 3 frame. It show a car that run on a road moving high-low and left- 
right. The camera has a similar movement. In the mass-centre path the numbers 
represent the frame numbers and give a temporal reference. There is a loop absent in 
practice. Figure 5 instead shows the same path situated on the mosaic. 




Fig. 5. Movement of car respect to background. 

The path is more understandable: it proceeds from north to south and from east to 
west. As we see the different paths depend on different points of view. The facts 
automatically extracted from the system areof the type: 

video(l,car,18). 

mass-ceutre(l,l,l,202,211). 

mass-ceutre(l,2,l,203,212). 

mass-ceutre(l, 18, 1,195.4, 238.3). 

affine(-1.961,-0.0007111,0.0002773,0.8482,-0.0007373,-0.0003905,18,l). 

affiue(-2.901,-0.005026,-0.004864,-9.16,-0.01003,-0.0008144,17,l). 
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affine(-2.388, 0.003019, -0.0002567, 5.572, 0.001408, 0.002383, 2,1). 
affine(0,0,0,0,0,0,l,l). 

If we pose the goal: 
fsud(l,_,_,_,_,l,Frame,l), 

we will have the answers: Frame = 2 Frame = 4, that means that in frame-intervals 
1-2 and 1-4 the car move respect to screen border through south. Differently when 
we pose the query: 

sud(l,_,_,_,_,l,Frame,l) 

The writing of predicate move is strictly related to the semantics and may be 
submitted to any database that is for any video. The possible instances of he variable 
Movement instead give the description of those video in which satisfy the predicate 
move. We can build predicates for searching trough the database. For instance with 
the predicate: 

move(Object,[south],Start,End,Video). 

Dynamic Video Query. The semantic approach, embedded in the Prolog predicates, 
may be of course defined by means the consultation of an expert. For instance by 
consulting an art director. Then the art director may analyse a film and choose this as 
a typical one, that is it choose a film containing some, given, technical properties, like 
pan, point-of-view movement, etc.. The technical content may be automatically 
analised by the semantic extractor previous defined. If we have a video base, in which 
the set of Prolog predicates and facts, are condensed in a semantic index: that is the 
indexing is realised by means of semantic description of the video’s, we can use the 
chose video as dynamic query. This will automatically extract from the video base all 
video with similar semantics. 



5. Conclusion 

The paper describes a method for analysing a sequence of images by means of a 
background on which shapes move. The approach reduces the video analysis 
complexity and allows to capture the moving objects with their shape and their 
dynamic features. The analysis is performed in two directions: 1 -detection of spatio- 
temporal relationships among objects and between objects and background, 2- 
inquiring of knowledge base to discover the video content. The mean is a robust 
extraction of a mosaic from a series of images. Then the movement of objects, 
independently from the background, and the camera movement are analysed. From 
the background and moving objects it is possible to reconstruct the original image 
sequence as well as to generate new ones. A linguistic interface allows to express in 
simple manner the semantics of the video’s. Mosaicing technique and movement 
analysis allow to extract all the Video Shapes (VS) contained in a film. The 
description of VS and of their movements is made through the automatic layout of a 
Prolog program. So Prolog programs using facts describes the Syntax of the video’s, 
while the layout of predicates contains the description of the semantics. A reference 
film may be analysed and compared with other films to find those one with similar 
semantics. The linguistic interface, based on a fact-data-base that is automatically 
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built by the system, constitutes an interesting research sector. In particular the future 
work has to he appointed on the fact interpretation, on the ’intelligent’ video analysis, 
as well as on the video indexing and retrieval by dynamic properties of shapes and on 
the automatic knowledge extraction from unknown video’s. 



References 

1. F. Ackermann, M. Hahn: Image Pyramids for Digital Photogrammetry - Digital 
Photogrammetric Systems - Wichmann Ed.- 43-57, Stuttgart; 

2. Anil K. Jain: Fundamentals of Digital image Processing - Cap. 9 Image Analysis and 
Computer Vision — Ed. Prentice Hall Information and System Sciences Series; 

3. R. Bergen, P.J. Burt: A Three-Frame Algorithm for Estimating Two-Component Image 
Motion - IEEE Transaction on Pattern Analysis and Machine Intelligence -V 14 N9 886- 
896 (1992) 

4. M. De Marsicoi, L. Cinque, S. Levialdi: Indexing pictorial documents by their content: a 
survey of current techniques - Image and Vision Computing -V.15, 119-141 (1997). 

5. P.R. Giaccone. G.A. Jones: Segmentation of Global Motion using Temporal Probabilistic 
Classification — BMVC ’98,; 

6. E. Gulch : Automatic Extraction of Geometric Eeatures from Digital Imagery — Digital 
Photogrammetric System - Ebner, Fritsch, Heipke Ed., Wichmann; 

7. Harpreet S. Sawhney, Serge Ayer: Compact Representations of Video Through Dominant 
and Multiple Motion Estimation - IEEE Transactions On Pattern Analisys and Machine 
Intelligence - Vol. 18, N. 8 (1996). 

8. B.K.P. Horn, B.F. Schunck: Determinig Optical Flow — Artificial Intelligence, V17, 185- 
203 (1981). 

9. M. Irani, P. Anadan, I. Bergen, R. Kumar, S. Hsu: Efficient Representations of Video 
Sequences and Their Applications - Signal Processing: image Comunication, V8 N4 
(1996). 

10. M. Irani, P. Anandan: Video Indexing Based on Mosaic Representations - Proceedings of 
IEEE, may 1998; 

11. M. Irani, Benny Rousso, Shmuel Peleg: Computing Occluding and Trasparent Motions — 
International Journal of Computer Vision, V12, 5-16 (1994). 

12. Ramesh Jain, Rangachar Kasturi, Brian G. Schunck: Machine Vision - Cap. 14 - Dynamic 
Vision;- Me Graw Hill Series in Computer Science. 

13. Rob Koenen: MPEG-4 Overview - (Melbourne Version) - - International Organisation for 
Standardisation ISO/IEC JTC1/SC29/WG1 1 Coding of Moving Pictures and Audio 
ISO/IEC JTC1/SC29AVG11 N2995 - October 1999; 

14. Rob Koenen: MPEG-7: Context, Objectives and Technical Radmap, V.12 - (Vancouver 
Version) - - International Organisation for Standardisation ISO/IEC JTC1/SC29/WG1 1 
Coding of Moving Pictures and Audio ISO/IEC JTC1/SC29/WG1 1 N2861 - July 1999; 

15. Emile Sahouria : Video Indexing Based on Object Motion — May 1997 - WWW; 

16. J. Shi, C. Tomasi: Good Eeatures to Track — IEEE Conference on Computer Vision and 
Pattern Recognition - Seattle 1994 June; 

17. S. M. Smith: Reviews of Optical Flow, Motion segmentation. Edge finding and Comer 
Finding — Technical Report TR97SMS1 - Oxford University (1997). 

18. T. Tommasini, A. Fusiello, V. Roberto: Robust Feature Tracking — Dipartimento di 
Matematica e Informatica Universita di Udine; 

19. R.C. Gonzales, R.E. Woods: Digital Image Processing — Ed. Addison Wesley 
Pubblishing Company; 




Behind the Image Sequence: The Semantics of Moving Shapes 629 



20. Newman Sproull: Principles of Interactive Computer Graphics 2° Edition — Computer 
Science Series; 

21. J.Y.A. Wang, E.H. Adelson: Representing Moving Image with Layers - IEEE Transaction 
on Image Processing, Special Issue : Image Sequence Compression -V3 N5, 625-638 
(1994). 

22. Han Wang, Michael Brady: Real-time comer detection algorithm for motion estimation — 
Image and Vision Computing -VI3 N9, 695-703 (1995). 




Probabilistic Hypothesis Generation for Rapid 
3D Object Recognition 



June-Ho Yi 



School of Electrical and Computer Engineering 
Sungkyunkwan University 
Suwon 440-746, Korea 
email: jhyi@ece.skku.ac.kr 



Abstract. A major concern in practical vision systems is how to 
retrieve the best matched models without exploring all possible object 
matches. This research presents probabilistic hypothesis generation 
based on indexing approach for the rapid recognition of three dimen- 
sional objects. We have defined the discriminatory power of a feature 
for a model object is defined in terms of a posteriori probability. 
This measure displays belief that a model appears in the scene after 
a feature is observed. We compute off-line the discriminatory power 
of features for model objects from CAD model data using computer 
graphic techniques. In order to speed up the indexing or selection 
of correct objects, we generate and verify the object hypotheses for 
features detected in a scene in the order of the discriminatory power of 
these features for model objects. Experimental results on synthetic and 
real range images show the effectiveness of our probabilistic method for 
hypothesis generation. 

Keywords: 3D, object recognition, probabilistic, indexing 



1 Introduction 

The fundamental issue in model-based recognition is how to rapidly narrow down 
the number of candidate models without actually searching through all the mod- 
els. This problem has motivated the use of indexing or hashing for efficient re- 
trieval of correct object model objects. In indexing, the feature correspondence 
and search of model database are replaced by a table look-up mechanism and 
this indexing table is computed off-line [1-2]. Recently, there have been some re- 
search works based on probabilistic indexing [3-4] where not only correspondence 
hypotheses but also the probability of each one being a correct interpretation is 
provided. 

Wheeler and Ikeuchi |5 compiled statistical information about image features 
and object features off-line from a large set of ray-traced images of each object. 
They represented the likelihood of hypotheses and their inter-dependencies using 
MRF (Markov random field) to select a set of hypotheses with strong supporting 
evidence. Their system only considers polyhedral model and does not handle 
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situations where only one surface is visible, while our system can handle ob- 
jects with curved surfaces and single surface view situations. Beis and Lowe 0 
also employed a probabilistic approach. They used 4-straight-line-segment chains 
(three angles and the ration of the interior edge lengths) as indexing vector and 
trained an indexing function (a linear combination of Gaussian centered on the 
indexing vectors) from synthetic images taken from various viewpoints. Their 
indexing vectors can not handle objects with curved edges. From the index- 
ing function computed, they obtain the probability of each hypothesis being a 
correct interpretation of the data. Performance of these systems cannot be com- 
pared directly because they have been developed based on different assumptions 
and they perform in different scenarios using different features to generate object 
hypotheses. 

We have employed a formal probabilistic solution for efficient indexing of 
correct model objects using a Bayesian framework. We define a decision-theoretic 
measure of the discriminatory power of a feature for a model object in terms of a 
posteriori probability. We estimate this measure off-line using computer graphic 
techniques. This measure allows us to employ salient features of model objects 
first for object recognition. In our system design, a measure of how well a feature 
can be detected, called ’’the detectability of a feature” is defined as a function of 
the feature itself, the viewpoint, sensor characteristics, and the feature detection 
algorithm. The detectability of a feature is incorporated into the formulation of 
the discriminatory power of a feature for a model object by considering model 
dependent information and sensing dependent information separately based on 
their conditional independence. In order to speed up the indexing or selection of 
the correct objects, we generate and verify the object hypotheses for the features 
detected in the scene, in the order of the discriminatory power of these features 
for model objects. By considering the object hypotheses in this order, we verify 
only a few correct hypotheses of the scene objects, resulting in the acceleration 
of recognition. 

The following section gives a brief overview of our vision system. In sectional 
we define a decision-theoretic measure of the discriminatory power of a feature 
for a model object. In section^ we describe how object features for indexing are 
automatically compiled using our example feature, LSG (Local Surface Group) . 
Section 0 presents experimental results on the effectiveness of our probabilistic 
indexing scheme. 



2 System Overview 

Let us briefly overview the entire object recognition system proposed. The system 
is divided into two parts. One is off-line compilation of model information and 
the other is on-line recognition. 

The first component is concerned with the automatic computation of object 
representations from a GAD model database for recognition of model objects. 
The second component is the range image simulator. One module of this compo- 
nent is for simulating the sensing process to estimate the detectability of features. 
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The other module renders each model object for a viewpoint sampling. From the 
rendered images, knowledge of all objects in the domain of interest is compiled. 
After renderings are done for all model objects, the third component computes 
a posteriori probability that a model object appears in a scene given a detected 
feature, i.e., the discriminatory power of a feature for a model object. As a result, 
a feature indexing table is constructed where features are linked to the models 
with a posteriori probabilities. This indexing table is loaded at recognition time. 

The on-line process consists of feature extraction, matching, and verification 
modules. The input to the feature extraction process is a dense range (depth) 
map from a single viewpoint. The feature extraction module detects features 
for generating object hypotheses. During the matching phase, features extracted 
from the scene are indexed by means of the precomputed indexing tables. A set 
of hypotheses are created and ordered in decreasing order of the probabilities 
associated with them. We validate these hypotheses applying a series of filters 
using geometric constrains. Finally, we obtain a list of valid hypotheses that will 
enter the verification stage. At the verification stage, the valid hypotheses are 
verified in the order they appear in the list (i.e., in the order of their probability). 




OFF-LINE 



ON-LINE 



Fig. 1. System Overview 



3 Discriminatory Power of a Feature for a Model Object 

We exploit the discriminatory power of a feature for a particular model object 
for efficient indexing of the best matched models. In order to define the discrim- 
inatory power of a feature for a model object in terms of a posteriori proba- 
bility, we start with the joint probability P{mk, Mi, viewpoint j), k = 1, • • • , /, 
i = 1, - ■ ■ ,N, and j = 1, • • • , w where mk, Mi, and viewpoint j denote a feature 
for indexing, the i-th model object, and the j-th viewpoint of a set of viewpoint 
samplings, respectively. /, N, and v represent the numbers of features, mod- 
els, and viewpoints, respectively. This joint probability encodes the information 
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conveyed by a feature of a model object. The same feature may occur in several 
different models. If a feature to be used is viewpoint independent, we can ignore 
viewpoint j in P{mk, Mi, viewpoint j). 

3.1 Definitions and Notations 

P{mk, Mi, viewpoint j) : joint probability of nik (a feature for indexing), Mi 
(f-th model object), and viewpoint j (j-th viewpoint of a set of viewpoint 
samplings), where k = 1, • • • , /, i = 1, • • • , iV, and j = 1, - ■ ■ ,v. 

P{Mi) : The probability that a given object in a scene is Mi. Then we have 

E.=iW) = i- 

P{nik/Mi) : a likelihood function, that is, 

P{m,k/Mi) > P{nik/Mj) means that the model Mi is more “likely” to be 
the model object that the feature nik belongs to than the model Mj, in that 
TOfc would be a more plausible instance of the features of the model Mi than 
the model Mj. 

P{Mi/nik) : This a posteriori probability reflects the updated belief that model 
Mi appears in the scene after the feature nik is observed. 

Dmk • Detectability of a feature mfe. It measures how well a feature mk can be 
detected. 

Definition : 

P{Mi/nik) is the discriminatory power of the feature nik for a particular model 
object Mi. 

The detectability of a feature is considered in the computation of the dis- 
criminatory power of a feature for a model object. The detectability of a feature, 
DmkJ depends on the feature mu itself (i.e. feature class). For example, a vertex 
feature may be less reliably detectable than a surface feature. changes as 
the viewpoint varies. For example, when a planar surface is detected in various 
viewpoints, it is more difficult to detect in a viewpoint involving a very high 
sloped appearance of the planar surface than would be the case in a viewpoint 
giving a flat appearance of the planar surface. The sensor’s capability is also 
important for a feature to be reliably detectable. Finally, the detectability of a 
feature can vary according to the feature detection algorithm used. Therefore, 
we represent the detectability of a feature ruk, Dm^ as a function of the above 
four factors: 



0 < Dnik = f(jnk, viewpoint, sensor, feature detection algorithm) < 1 (1) 

3.2 Computation of Discriminatory Power 

In the following, we will describe how to estimate a posteriori probability, 
P{Mi/nik). Let us denote estimates of quantities defined in the previous section 
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by a hat above the symbol. P{Mi) and P{nik, viewpointj /Mi) can be calculated 
once we know the specific application domain and determine which feature to 
use. P{Mi) can be computed by observing the frequency of the appearance of 
the model object Mi in the scene and normalizing it by the total number of 
observations of all models. Once the decision to use feature for the indexing 
of model objects is made, we compute P{mk, viewpointj /Mi) by counting the 
number of appearances of the feature ruk in model object Mi for the viewpointj. 
However, the feature is often not perfectly detectable, i.e., Dm^ is not 1.0. 
To incorporate feature detectability into the computation of the discrimina- 
tory power, we consider model dependent information and sensing dependent 
information separately. That is, we estimate the model dependent information, 
P(mfc, viewpoint j / Mi) ^ assuming perfect detectability of the feature and in- 
corporate the sensing dependent information by multiplying these two terms as 
follows: 

P{mk, viewpoint j / Mi) ■ Dm,f. 

_ / // occurrences of the feature mk in m, for viewpointj \ jj 

~ V# occurrences of all features ■••,/„ in Mt for all viewpoints/ 

( 2 ) 

This way, feature detectability can be incorporated into the computation of the 
discriminatory power when CAD model data is used. Therefore, the likelihood 
P{mk/Mi) and a posteriori probability P{Mi/mk) are computed as 

V 

P{mk/Mi) = ^ P{mk, viewpointj /Mi) ■ (3) 

1=1 



and 



P{Mjnik) = 



P{Mi) P{nik, viewpoint j / Mi) ■ D„ 



Yh=i P{Mi) Pi^k, viewpointj /M/) ■ Dr, 



( 4 ) 



Note that if a feature ruk does not exist in the model object M^, P{mk/Mi) = 
0.0. As previously stated, the same formulation Q and (0) can be applied to 
viewpoint independent features by ignoring the viewpointj term. 

Given a particular feature and viewpoint, estimating Dmk amounts to deter- 
mining how different feature detection algorithms behave under different sensor 
characteristics (for example, signal/noise ratio) 0. For the case of edge detection 
in which the feature is an edge, the Sobel operator is known to perform better in 
noisy situation than the Robert’s cross. Therefore Dm^ for edge features would 
be higher for the Sobel operator than for the Robert’s cross. In fact, it is pos- 
sible to analytically determine the probability of detecting an edge using either 
algorithm with a given signal/noise ratio. In the current prototype system, Dmk 
is assumed a constant. 



4 Construction of Indexing Table 

In this section, we describe our example feature, LSG, for object hypothesis 
generation and present how to construct an indexing table. 
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4.1 Model Features for Object Hypothesis Generation 



Our object recognition system can employ a wide class of features for object 
hypothesis generation as long as the discriminatory power, P{Mi/mk) can be 
computed. In the current prototype system, we use the LSG (local surface group). 
LSG is not a simple feature but a viewpoint dependent feature structure that 
contains several attributes. Figure |2| shows an example of a LSG that consists of 
a visible surface patch Ci and its two adjacent surface patches. Pi and P 2 , that 
are simultaneously visible for the given viewpoint. Once we know the adjacent 
surfaces that are simultaneously visible, we access the node attribute set of the 
attribute-relational graph corresponding to the model object and can extract 
the information shown in the LSG. The most popular surface types used in com- 
puter vision are quadric surfaces because the majority of man-made objects can 
be modeled by them. Among the quadrics, planar, cylindrical {ridge and valley), 
and spherical {peak and pit) surfaces are supported in the current prototype 
system. The last entry of each surface patch in the attribute <simultaneously- 
visible-adjacent-surfaces: < list of surfaces >> is the angle between the surface 
orientation of the seed surface and the adjacent surface. This angle is only appli- 
cable only when two surface types are either planar or cylindrical {ridge, valley). 
Surface orientation is defined for planar surfaces as the direction of surface nor- 
mal and for cylindrical surfaces as the direction of the axis, respectively. For 
pairs of other surface types, the angle value is set to NIL which indicates an 
undefined value. Note that the LSG is a viewpoint dependent feature structure 
and that the number of LSGs for a viewpoint is theoretically at most the num- 
ber of visible surface patches for this given viewpoint. LSG can be extended to 
incorporate other feature attributes such as color and texture information. 

Indexing involves a tradeoff between the complexity of the indexing feature 
and efficiency of indexing using the feature. If the indexing feature is complex 
(for example, a whole object as an indexing feature), indexing will not be efficient 
but will result in only a few candidate models. On the other hand, if a simple 
feature is used for indexing (for example, a surface patch as the indexing feature), 
indexing will be easy while many model objects are indexed for a feature detected 
in the scene. We will use a subset of the LSG as an indexing feature because the 
use of the complete LSG as an indexing feature makes the indexing of model 
objects complex and computationally expensive. Ghoosing an indexing feature 
of optimal complexity in the sense of recognition performance is a topic for 
further work. We will call our indexing feature “Indexing_LSG” . An example 
of the IndexingTjSG that is used in our current system for the indexing of 
model objects is shown in Figure |21 In an Indexing_LSG, only the surface type 
information of simultaneously visible adjacent surface patches and the sum of the 
angles listed in the LSG are encoded without distinguishing respective adjacent 
surface patches. Different instances of this Indexing_LSG are the m^’s in section 
01 



636 



J.-H. Yi 



An OKuiopL- ihflisG (local Surfuc-c Oroup) 




Isocd-surtacc-lahcl. r j 

< mitdcl-hibol: ••• > 

< surtacc'typc; ridju* ( 1 > > 

< r.idiuN. 0.75 > 

< simuliatK:ously-«5»ihIc-ad,laiTm-surtaa:s: 1 

i P j .. plimur (0), urea, rad 1 u^ (nil>, 0 ^ } 

^ f 2 . planar K))- area. nidiuMnili. | > 

Indexing liiO 

IsuriaeC'iypcHJf-'iceU-surlaec: ridge( I) 
5urlace-iy[V5H>l- 

<!iiirtlull;>rte«HL<i|)f-visiMo-: (plaiUr (0). filanar (0» > 

udlacem -surfaces 
<Mini-o*-unglcs; 



Fig. 2. An example of LSG 



4.2 Automatic Compilation of LSGs 

A model object is taken from our model database, rendered using z buffer algo- 
rithm along with surface labels for a viewpoint sampling, and LSGs are compiled. 
To obtain a viewpoint sampling, we use the dual of a geodesic polyhedron with 
frequency Q (default 4) of geodesic division based on icosahedron. This dual 
polyhedron generates lOQ^ -I- 2 (default 162) viewpoints on a unit sphere. The 
process of compiling LSGs can be summarized as: 

For i = 1,2, ■ ■ ■ , N 
For j = 1,2, - ■■ ,v 

Render the range image of the object. Mi, for viewpointj 
along with surface labels. 

Scan the range image to collect LSGs. 
end for j 
end for i 

Gompute P{Mi/mkys and return the indexing table. 

5 Experimental Results 

5.1 Extracting and Ordering Scene LSGs 

To compute LSGs from an input scene image, we segment the image first and 
characterize feature attributes of surface patches such as primitive surface type, 
area/radius of surface, surface normal direction, and so on. 

5.2 Probabilistic Hypothesis Generation 

We have experimented our approach using the 20 object model database shown 
in Figure 0 We have visualized in Figure 0 the distribution of P{Mi/mk)- Let 
us make several comments about the information displayed in Figure 0 If the 
discriminatory power of a feature (wfc) for a model object {Mj) is 1.0 (i.e. 
P{Mj /mk) = 1.0), the feature, nik, is unique to the model object. In other words. 
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if the feature, irik, is detected in the scene, it is certain that the model Mj is in 
the scene. On the other hand, suppose that feature, m 2 , is detected in the scene. 



Then, 


P(M3/to2) = 0.425 


> 


P(Mg/m 2 ) = 0.401 


> 


P{M^/m2) = 0.174 


indicates the belief that appearance of model objects 


in the scene is plausible in 



the order of M3, Mg and M5. Model object, M3, is hypothesized first and then 
Mg and M5. Similarly, when several features are detected in the scene, object 
hypotheses are generated in the order of the discriminatory power, P{Mi/mk). 



6 O 



Mg 




Ms Mf, 






Mi5 Mi6 




M]3 Mi4 




Fig. 3. Model database 



5.3 Hypothesis Verification 

We verify the hypotheses listed in the valid-hypotheses, one by one, in the order 
in which they appear. In order to verify an object hypothesis, we first find what 
model surfaces, other than the initially matched model surfaces in the hypoth- 
esis, should appear in the scene image. We compute a candidate view of the 
hypothesized model object from the initially matched pairs of scene and model 
surfaces in the object hypothesis. We render the hypothesized model object for 
the computed view and compute a list of neighboring surface pairs. Then the 
verification routine checks whether the model surface pairs can be found in the 
list of scene surface pairs. Compatibility between a model surface and the cor- 
responding scene surface is determined based on the geometric constraints such 
as surface area, radius, and angle between two surfaces. 

5.4 Indexing Efficiency 

To experimentally determine the effectiveness of our indexing scheme, we define 
a measure of capability to index correct objects for our technique. We name this 
measure indexing-ejficiency-measure and it is defined as: 
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Fig. 4. Distribution of P{Mi/mkys for 20 object model database shown in Figure^l 
P{Mi/mk) = 0.0 (black) and P{Mi/mk) = 1-0 (white) 



Definition: 

indexing- ejficiency-measure = position of the successfully verified hypothesis in 
the list of hypotheses initially generated. 

We have experimented using a set of synthetic and real range images. A tab- 
ular summary of the results is shown in Table Q We have generated synthetic 
range images of all objects in the model database for 10 randomly selected views 
for each object (total of 200 experiments). For real range images, we have built 
four objects, Mq, M3, M5 and M15. 13 range images of these objects for several 
different poses were scanned. The average value of indexing-efficiency-measure 
was 2.80 and 2.68 for the synthetic and the real range images, respectively. That 
is, correct hypotheses were located near the third position in the list of hypothe- 
ses. This proves the effectiveness of our indexing scheme for the current model 
database although we adopted a simplified version of a LSG as an Indexing_LSGs. 
The average number of hypothesis verifications leading to successful recognition 
was 1.7 for the synthetic range images and 1.8 for the real range images, respec- 
tively, because hypothesis validation using geometric constraints served as an 
extra filter before each hypothesis entered actual verification procedure. 

6 Summary and Conclusions 

We have proposed a probabilistic method for efficient generation of object hy- 
potheses, based on indexing approach. We achieve rapid recognition by generat- 
ing the object hypotheses for the features detected in the scene in the order of 
the discriminatory power of these features for model objects. The discriminatory 
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Table 1. Experimental results 



number of 
images 


recognition 

accuracy 


indexing- 

efficiency-measure 


200 synthetic 
images 


89.0% (178/200) 


2.80 


13 real images 


84.6% (11/13) 


2.68 



power of an indexing-LSG in favor of an object model is computed off-line by 
compiling statistics from the rendered images of the model objects in the model 
database. 

We experimentally proved the effectiveness of our indexing scheme using a 
feature structure called LSG (Local Surface Group) for generating the object 
hypotheses. The novelty of our approach is in the use of a formal probabilistic 
solution for efficient indexing of correct model objects, resulting in a speeding 
up of recognition. 
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Abstract. In this paper we present an efficient method for smooth sur- 
face generation from unorganised points using NURBS. This is a pre- 
ferred alternative to using triangular meshes, which are expensive to 
store, transmit, render and are difficult to manipulate. The proposed 
method does not require triangulation prior to surface fitting because 
it generates NURBS directly. Two fundamental problems must be ad- 
dressed to accomplish this task: parameterisation of measured data and 
overcoming ill-conditioning of the least squares surface fitting. We pro- 
pose to solve the parameterisation problem by employing a suitable base 
surface, automatically generated from the data points, or provided as a 
CAD model if available. Ill-conditioning was solved by introducing ad- 
ditional fitting criteria in the minimisation functional, which constrain 
the fitted surface in the regions with insufficient number of data points. 
Surface fitting is performed by treating the surface as a whole without 
the need to either identify or re-measure the regions with insufficient 
data. The accuracy of fitting is dictated by the number of control points. 
The improvements in data compression, shape analysis and rendering 
are presented. The realised computational speed and the quality of the 
results were found to be highly encouraging. 



1 Introduction 

The problem of approximating a surface from a cloud of unordered 3D points 
appears in many areas including computer vision, computer aided design and 
object recognition. With the advancements in technology allowing fast digiti- 
sation of a large number of points on an object, there is a clear need for new 
methods that can handle large amounts of data in acceptable time and memory 
space. 

At the same time, in today’s industrial practice many products are designed 
using free-form surfaces. The principal modelling entity is NURBS (Non-Uniform 
Rational B-spline) . Trimmed NURBS are also widely used, because they largely 
overcome the limitations imposed by the strictly rectangular domain of tensor 
product surfaces and provide additional flexibility. NURBS are especially suit- 
able for a web applications as they allow data compression and are compatible 
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with OpenGL. They are also supported by a large number of CAD programs. 
It is therefore natural to try to reconstruct measured objects using NURBS and 
trimmed NURBS. 

Terzopoulos and Qin M propose dynamic NURBS for scattered data fitting, 
based on a computational physics framework for shape modelling, in which the 
fitted surface results from the numerical integration of a set of non-linear equa- 
tions. However, they report that in their implementation matrix assembly and 
matrix vector multiplication quickly become too costly, so in practice they are 
limited to using surfaces of the order of 10 by 10 control points. Also, blended 
deformable models of Snyder m and generative modelling of Ramamoorthi and 
Arvo ffH might also be seen as tools for surface fitting. They use the measured 
data cloud with a user defined class of models which are a generalisation of swept 
surfaces. A shortcoming of this approach is that it is limited in representing local 
detail. 

The main work on surface updating was done by Ma, Kruth and He m 
who present methods for least squares fitting of B-splines to unorganised points. 
It is well known that methods based on least squares fitting have a potential 
problem with rank deficient matrices, which is a direct result of an insufficient 
coverage of certain regions by the measurements [Q. The main idea proposed 
by Ma and He is to avoid the singularity problem by excluding from fitting 
those control points for which the position is not defined by the data. After 
applying this solution it is still possible that the system is ill-conditioned due to 
sparsity of data in some regions, so it was suggested that those regions should 
be re-measured. However, this is not very practical for shape refinement since 
fitting, knot insertion and measuring may need to be re-iterated a number of 
times. 

In this paper we present a novel method to generate or update the model 
based on a set of unorganised measured points in three-dimensional space. Our 
approach features the following: 

— Eliminates the need for pre-processing/triangulation and directly generates 
NURBS, 

— Starts with a CAD model if available, or automatically generates initial 
surface, 

— Solves the problem of ill-conditioning through regularisation, and 

— Exploits banded-matrix-based algorithms to gain in computational efficiency 

2 Overview of the Algorithm 

The algorithm is based on least squares fitting. To define a fitting problem the 
measurement data set, the degree of the curve or surface to be constructed and 
the error bound specification are required as input data. It is usually not known 
in advance how many control points are required to obtain the desired accuracy, 
hence approximation methods are generally iterative. 

Least squares fitting requires parameterisation of the input data. The assign- 
ment of u, V parameters is crucial as it has a strong effect on the shape of the 
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fitted surface. A number of methods to parameterise points have been published 
but the majority of them make the assumption that the data is ordered. Since 
our work aims to deal with both ordered and unordered data, an alternative 
method was developed following the suggestion by Ma and Kruth j7], where the 
parameterisation can be achieved by projecting the points onto a base surface, 
from which the Uk and Vk values are obtained. The base surface might be seen as 
the first approximation of the final surface. Consequently, there are three main 
parts to this algorithm. The first is to initialise the fitting surface and the related 
parameters, the second to fit the surface to the data points, and the third to 
insert additional knots if necessary. 

2.1 Base Surface 

There is no general solution to the problem of how to generate the base surface. 
Various authors suggest different strategies depending on the complexity of the 
object. In the approach adopted in this work a CAD model of the object is used 
as a base surface, if available. Otherwise, we devised an algorithm to generate the 
base surface automatically within our method, for the case of single-valued data 
and for the case where the object is of tubular shape. If data is single-valued, 
such as, for example, a single-view range image of an object, then a single- 
valued surface may be used to represent the model, in which case a rectangle 
is automatically generated. If, however, the surface is closed and of a tubular 
geometry a generalised cylinder is automatically generated. In both cases the 
initial number of control points may be defined by the user, with uniform knot 
vectors and weights set to 1. The flexibility of the surface can be further increased 
by knot refinement during the fitting procedure, where in each step a number 
of knots is inserted in areas with largest deviation until required accuracy is 
achieved. 



Planar Base Surface Generation. We compute the centre of mass and prin- 
cipal axes for the cloud of data. Measured points are then projected onto a plane 
defined by the centre of mass and least principal axis. A rectangle containing 
all projected points is constructed in that plane. Finally, we generate a NURBS 
surface as bilinear interpolation of the rectangle’s edges. 



Generalised Gylinder Generation. Using effective shape control techniques 
for generalised cylinders it is possible to model natural shapes such as trees, 
arms, legs and bodies. 

The description of objects with cylindrical shapes has been extensively used 
in the computer vision community. A generalised cylinder is obtained by sweep- 
ing a 2D cross-section along a trajectory space curve (called spine), in which the 
2D cross section may change its shape dynamically while moving along the tra- 
jectory curve. In contrast to voxelised data, we start with an unorganised cloud 
of points sampled on a surface. A 3D curve must be specified inside the cloud of 
points in order to use a cylindrical model and to represent the surface. Although 
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essential, the issue of finding a suitable axis in order to design a generalised 
cylinder has rarely been discussed. 

We have implemented the method suggested by Nazarian, Chedot and Se- 
queira jS] where recursive subdivision of the set of data points is used. At each 
step the subsets of data points are split by a plane perpendicular to their main 
axes of inertia and passing through their barycenters. The resulting axes is a 
polyline composed of main axes of parts of the initial set of data points. The 
authors recognise that their method does not always work. Moreover, in some 
cases, it is even impossible to find the spine automatically due to the shape and 
non-uniform density of measured points. For those cases, we have implemented 
a new method to interactively define the spine. The details of the interactive 
method are outside the scope of this paper. 

Using the spine we generate a constant radius cylinder as a swept surface 
P], which is then subject to least squares minimisation, the variables being the 
positions of control points. We interpret the generalised cylinder as a smooth 
deformation of a cylinder. 



2.2 Iterative NURBS Fitting 

Least squares fitting of NURBS through a set of points would lead to a non- 
linear optimisation problem if the unknowns are the control points, parameters 
(u, v), knots, and the weights. However, the non-linear nature of the problem 
can be avoided and the optimisation can be greatly simplified, if the weights and 
the knot vectors are set a priori. We propose using the weights and knot vectors 
obtained from the initial model and then optimising only the positions of control 
points. 

By denoting the measured points as Qi,...,Qm, we set up a linear least 
squares problem for the unknown control points. The functional to be minimised 
is: 



M 

f = “ S{uk,Vk) 



2 



( 1 ) 



where Uk and Vk are the parametric co-ordinates corresponding to the closest 
point to each measured point. 

The new position of the N control points are obtained as a solution to the 
system of normal equations 



A^Aa = A'^b (2) 

where A is an M x A matrix of B-spline coefficients corresponding to the 
closest points on the base surface for all measured points, a is the vector of 
control points and b is the vector of measured points P). 

Normal equations can be solved using a number of methods. We have imple- 
mented the iterative method of Gauss-Seidel, as well as Cholesky decomposition. 
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Rank Deficiency and 111- Conditioning of the Least Squares Problem. 

The matrix A^A in the normal equations is prone to rank deficiency. As the 
size of the system increases it is also likely that the set of equations becomes 
ill-conditioned. The problem of rank deficiency arises from the localised nature 
of the basis functions and its detailed presentation is provided by Dierckx |2|. 
The main reasons for this are: 

— Incomplete data sets due to inaccessibility (most often close to the edges) 

— Incomplete data sets due to the use of trimmed NURBS in modelling 

— Knot insertion 




Fig. 1. Surface fitting and knot insertion, (a) Initial surface and unevenly distributed 
measurement cloud; (b) low flexibility fitted surface; (c) high flexibility fitted surface 



As an illustration, Fig.Q presents an example in which a base surface is 
updated using unevenly distributed measurements (Fig.^). The results of sur- 
face fitting (Fig.[n)) show that the available number of control points provide 
insufficient flexibility for the surface. This is a clear case for employing knot 
insertion which provides additional flexibility (Fig.IO:). Consequently, fitting is 
significantly improved in the central area, but this leads to a situation where 
empty knot segments start to appear in the sparsely measured outer regions, 
causing ill-conditioning of matrix A^A and corruption of the shape. 

Proposed Solution: Regularisation. If the problem is ill-posed there is no 
way to overcome this unless additional a priori information about the solution 
is available. The majority of authors use smoothness as such additional crite- 
rion to restore stability and construct efficient numerical algorithms, but this 
choice proves costly in terms of computational speed and memory space. We 
have adopted a different, novel approach introducing new constraints, which 
proved computationally highly efficient. 

In developing the solution for the regularisation problem, it was noted that 
when the system becomes unstable, the control points associated with the areas 
with insufficient data move in an uncontrollable fashion, away from the surface. 
Our principal concept is based on the fact that control points do approximate 
the surface and it seems natural to keep them as close to the surface as possible. 
This is achieved by introducing an additional criterion ( “a-criterion” ) , which is 
to minimise the sum of the squared distances between the control points and 
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their corresponding points on the fitted surface. We expected this criterion to 
smoothen the surface and to provide an equivalent of energy minimisation. 

Mathematically, this is reflected in expanding the functional of Equation o 
to include an additional “a-criterion” term, as follows: 



M 

f = ^^\Qk - S(Ufc,Ufc) 

k=l 



i=0 j=0' 






( 3 ) 



where Pij are the control points and S{uij,Vij) are their corresponding 
points on the surface, while coefflcient a > 0 provides the required trade-off 
flexibility. Naturally, the question arises as to how to define the corresponding 
surface points. We adopted a solution using Greville abscissae m, because 
they are obtained using knot averaging and therefore provide high regularity of 
a matrix. 

After implementing the concept, extensive experiments have been conducted 
with a variety of shapes and the inclusion of a-criterion proved highly beneficial, 
particularly in case of large deformations of the initial model. 

The use of Greville points improves conditioning of the system but does 
not guarantee uniqueness of the solution. Thus, in the knowledge that matrix 
A^A is possibly degenerate, a new criterion and a corresponding weight need 
to be introduced in order to set up a well-posed problem. We labelled it “/3- 
criterion” and introduced it with the aim to limit the displacement of the control 
points relative to their original positions. This constraint minimises the combined 
movement of all control points Pj. 

The minimisation problem still remains linear and the cost function to be 
minimised now becomes: 



M 



/ = ^Qfc-S(ufe,Ufc) 












2=0 j—0 



i—0 j—0 



( 4 ) 

where P°^ is the original position of the control point Pij , and /3 > 0 is a 
weighting factor. 

A necessary condition for a minimum is that the gradient of / equals to zero, 
which gives the following generalised normal equations: 



A^A + a (B - I)^ (B - I) -k pi 



a =A^b 



/3a° 



( 5 ) 



where B is the matrix of B-spline coefficients corresponding to Greville 
points, I is a unity matrix and a° is the vector of original control points. 

The system of equations © is solved using the same methods as those used 
for solving (0. The only additional task is to compute (B-I)^(B-I), the compu- 
tational cost of which is negligible since, in general, the number of control points 
is far fewer than the number of measured points. 

It is worth noting that the two parts of the minimisation will have comparable 
weights cm by choosing: 
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Tr(A^A) 

“ - Tr((B-I)^(B-I)) 



( 6 ) 



In the proposed regularisation method, the quality of the fitted surface may 
be controlled by a trade-off between the weights of a and /3. It is therefore 
instructive to analyse the fitting results for different values of these parameters. 




Fig. 2. (a) Measured data and the base surface; (b) Effect of varying a and (3 on the 
fitting results 



In Fig.0the updating of the model is presented. Note that the data is avail- 
able only in the small part of the object (cone-like deformation in the centre). 
The parameter values in Fig.|2t) represent relative weights according to Equa- 
tion ©• From these results the following conclusions may be drawn: Sufficiently 
small a and j3 have negligible effect in the regions with enough data but stabilise 
the surface in regions with no data. In particular, small positive (3 will introduce 
the effect of removing instability from a system. Increasing a or /3 will ultimately 
begin to affect even the regions covered by the data, which is undesirable. 

Apart from removing instability, the /3-criterion also acts to preserve the 
original shape. It is worth noting that this effect is unwanted on the measured 
regions and can be effectively eliminated by iterative fitting. 

The Qf-criterion has the effect of flattening the regions unpopulated, or 
sparsely populated by the data points. 

In all our experiments we were using a = 0.1 and /3 = le — 9. The users are, 
however, encouraged to make their own selection of weights to control the shape 
of the unmeasured regions. This involves setting a balance between maintaining 
the original shape of the unmeasured regions on the one hand, and reducing the 
energy contained in them, on the other. 
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2.3 Surface Trimming 

In Fig. 3b it is shown that the generated surface contains superfluous areas that 
do not correspond to the measured cloud. These are due to the rectangular shape 
of the initial surface. Hence, these areas must be trimmed and an appropriate 
automatic trimming routine has been developed. The procedure is as follows: 

— For each measured point And the closest point on the fitted surface 

— Find the convex hull of these points using Graham’s algorithm |B| 

— Fit a trimming curve through the convex hull in a parametric space 



2.4 Computational Efficiency 

Fitting complex surfaces through large measurement data sets can be expen- 
sive in terms of both computational time and memory, a significant proportion 
of which can be attributed to the matrix multiplication A^A. The time and 
memory requirements can be drastically reduced by exploiting the sparse and 
banded nature of A. This means that only non-zero elements of A are stored 
and directly multiplied. As a result it is possible to reduce both the number of 
computations and the memory requirements. 

The key features of our technique can be summarised as follows: 

— A^A is computed without storing A or A^ so that the memory space needed 
is linear with a number of control points. 

— Time to compute A^A is linear with the number of measured points and 
does not depend on a number of control points. 

— Time to solve A^A is linear with the number of control points and does not 
depend on the number of measured points. 

A full discussion of the algorithm implementation is beyond the scope of this 
paper, and it will be a subject of a separate publication. Nevertheless, the results 
in Table 1 are included in order to provide an indication of the performance in 
terms of computational speed and memory requirements (Pentium III, 400MHz). 



Table 1. Computational performance with regards to time and memory 



Number of 


Number of Control Points 






data points 


10 


100 


1,000 


10,000 


100 


0.03s 42.6KB 


0.09s 146KB 


- 


- 


1,000 


0.08s 42.6KB 


0.14s 146KB 


0.78s 1054KB 


- 


10,000 


0.62s 42.6KB 


0.69s 146KB 


1.39s 1054KB 


7.2s 9126KB 


100,000 


6.17s 42.6KB 


6.20s 146KB 


6.97s 1054KB 


13s 9126KB 


1,000,000 


61.8s 42.6KB 


61.5s 146KB 


63.0s 1054KB 


69s 9126KB 
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3 Results 

The reconstruction of the Igea artefact (Fig.0) demonstrates the quality of the 
results obtained using the rectangular base surface. Courtesy of University of 
Thessalonica, measured data is available from www.cyberware.com. The mea- 
sured cloud and the automatically generated NURBS rectangle are visualised in 
the Fig.^. The fitted and trimmed surfaces are presented in the Figs.0D-c. 





Fig. 3. Igea artefact fitting stages using planar base surface 



The reconstruction of the scanned knee, Fig. 0 demonstrates the quality of the 
results obtained using the cylindrical base. Data and base cylinder are presented 
in Fig.Et. The smoothing effects of the fitting can be seen when comparing the 
NURBS fitted surface of Fig.0D with the triangulation (courtesy of Cyberware) 

in Fig.Efc. 

In order to demonstrate the quality of the results obtained using CAD model, 
we present a car windscreen involving a trimmed NURBS surface as shown in 
Fig.0 The grey area of Fig.lSti represents the digitised region corresponding to 
the untrimmed surface, whilst the isoparametric curves show the base surface. 
The modified shape after fitting is shown in Fig. 03- 

4 Conclusions 

The paper has presented a new method for modelling the shape of an object from 
a cloud of digitised points using NURBS. No special assumptions are made about 
the data distribution. Least squares fitting, which is the basis of the method, 
frequently suffers from the problems of rank deficient and ill-conditioned ma- 
trices. This was overcome by the regularisation of the least squares problem 
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through adoption of additional criteria in the minimising functional, producing 
good results in both the measured and the unmeasured surface regions. The 
method requires an initial approximation of the object shape and this may be 
generated automatically, from the point cloud. Alternatively, in industrial ap- 
plications, the initial shape may be provided by a pre-defined CAD model. The 
implemented algorithms utilise the sparse structure of the matrices and achieve 
a considerable improvement over previously reported implementations in terms 
of computational speed and memory requirements. The method was successfully 
applied to a wide range of shapes, art objects as well as engineering parts, lead- 
ing us to conclude that the proposed approach is feasible and attractive for a 
wide range of potential applications. 



References 

1. BJorck, A Numerical methods for least squares problems, Society for Industrial 
and Applied Mathematics, Philadelphia (1996). 

2. Dierckx, P Curve and surface fitting with splines, Oxford, Clarendon (1993). 

3. Farin, G E Curves and surfaces for computer-aided geometric design: a practical 
guide, 4th ed.. Academic Press, San Diego; London (1997). 

4. Gordon, W and Riesenfield,. R ’B-spline curves and surfaces’ Computer Aided 
Geometric Design ed. Barnhill and Rieseheld, Academic Press, New York (1974) 
95-125. 

5. Nazarian, B, Chedor, C and Sequeira, J ’Automatic reconstruction of irregular 
tubular structures using generalised cylinders’ MICAD 96 - Revue Internationale 
de CFAO at d’Infographie 11:11-20 

6. Ma, W and He, P R ’B-spline surface local updating with unorganised points’ 
Computer Aided Design 30 (11) (1998) 853-862. 

7. Ma, W and Kruth, J P ’Parameterization of randomly measured points for least 
squares fitting of B-spline curves and surfaces’ Computer-Aided Design 27 (9) 
(1995) 663-675. 

8. O’Rourke, J Computational geometry in C, Cambridge University Press (1998). 

9. Piegl, L A and Tiller, W The NURBS book, 2nd edn. Springer (1997). 

10. Press, W H, Teukolsky, S A, Vetterling W T and Flannery, B P Numerical Recipes 
in C : The Art of Scientific Computing, 2nd edn, Cambridge University Press 
(1993). 

11. Ramamoorthi, R. and Arvo, J ’Creating Generative models from Range Images’ 
Computer Graphics proceedings (1999) 195-204. 

12. Ristic, M, Brujic, D and Ainsworth, I. ’Precision Reconstruction of Manufactured 
Free-Form Components’ 12th Annual International Symposium SPIE Electronic 
Imaging 2000, San Jose, California (2000). 

13. Snyder, J and Kajiya, J ’Generative Modelling: A symbolic system for geometric 
modelling’ Computer Graphics (SIGGRAPH 92 proceedings) (1992) 369-378. 

14. Terzopoulos, D and Qin, H ’Dynamic NURBS with Geometric Constraints for 
interactive Sculpting’ ACM Transactions on Graphics, 13 (2) (1994) 103-136. 




Non-manifold Multi-tessellation: From Meshes 
to Iconic Representations of Objects 



Leila De Floriani, Paola Magillo, Franco Morando, and Enrico Puppo 



Department of Computer and Information Sciences (DISI), 
University of Genova, Via Dodecanese 35, 16146 Genova - Italy 
{def lo .magillo, morando ,puppo}@disi . unige . it 
http : //www . disi .unige . it/research/Geometric_modeling/ 



Abstract. This paper describes preliminary research work aimed at ob- 
taining a multi-level iconic representation of 3D objects from geometric 
meshes. A single-level iconic model describes an object through parts 
of different dimensions connected to form a hypergraph. The multi-level 
iconic model, called Non-manifold Multi-Tessellation, incorporates de- 
compositions of an object into parts at different levels of abstraction, 
and permits to refine an iconic representation selectively. 



1 Introduction 

Polygonal meshes, in particular triangle meshes, are widely used representations 
of three-dimensional shapes in computer graphics, virtual reality, and simulation. 
As devices and systems for 3D object reconstruction become more and more 
common and reliable meshes increase their relevance in applications. For 
instance, meshes are a suitable input to model databases nmn! within systems 
for generic 3D shape recognition and classification. 

Triangle meshes can approximate arbitrarily well the shape of an object, but 
they do not provide information on either its structure, or its morphological 
features. On the contrary, iconic models, intended as concise, part-based repre- 
sentations of an object, provide more structured descriptions, even if sometimes 
less accurate, thus giving a valid support for many application tasks. 

This paper describes key ideas and some preliminary results of our ongo- 
ing work on multiresolution iconic representations of 3D objects. We consider 
an object initially described as a triangle mesh, and we devise its progressive 
decomposition into parts, of different dimensions, that leads to an iconic rep- 
resentation of the object. This approach reflects the intuitive idea that some 
parts of an object can be perceived as lower-dimensional, depending on the level 
of abstraction Each part of an object is represented as a geometric complex of 
the proper dimension and is characterized by some geometrical and topological 
shape features. The different parts arising from the decomposition are connected 
at non-manifold junctions. 

An iconic model can be provided at different levels of abstraction, depending 
on the number and dimension of its parts, and on their connection structure. We 
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propose next a model that can encode a whole range of levels of detail, allowing 
us to refine the representation locally, in those parts of the object that are more 
interesting for a given application. 

In this paper, we introduce the single- and multi-level iconic models, and we 
address their design, and related data structures. More specific issues concerning 
their construction, and applications of such model are just briefly outlined, as 
they are and will be the subject of our current and future research. 



1.1 Related Work 

There exists a large body of literature on the segmentation and representation 
of objects based on part decomposition (see, e.g., P for a fairly complete set of 
references. Typical approaches are based on a set of elementary shapes (general- 
ized cylinders, geons, superquadrics) mm- Siddiqi and Kimia ini developed 
a framework for partitioning schemes for decomposing 2D shapes and presented 
a hierarchical scheme which combines a boundary-based and a part-based ap- 
proach. There exist, however, very few proposals working on mesh-based repre- 
sentations. An example is the recent work by Cutzu j^, who uses a triangular 
mesh as object representation in finding a part decomposition that takes into 
account the perceptual similarities among several of the object views. 

A method for subdividing non-manifold geometric complexes into manifold 
parts has been proposed in jH] for an application to triangle mesh compression for 
transmission. Their approach is somehow similar to our method for identifying 
manifold components in a non-manifold complex. 

Mesh simplification is popular in computer graphics to produce descriptions 
of objects at different levels of detail and several proposals exist in recent liter- 
ature (see, e.g., ^ for a survey). Many methods are based on edge collapse, i.e., 
on a local operator which contracts an edge to a vertex. 

Multiresolution models based on meshes have also been extensively studied 
in computer graphics in order to provide compact ways of describing several lev- 
els of details in a single data structure 0. However, existing models are mainly 
oriented to visualization. A general multiresolution model for representing 3D 
shapes described by triangle meshes was proposed in 0. Such model was, how- 
ever, restricted to describe manifold shapes. The model we propose in this paper 
is somehow inspired to that work. 



1.2 Preliminaries 

In the following, we briefly and informally review some standard concepts of 
algebraic topology. See, e.g., | 2 | for a more formal treatment. 

Simplices of dimension 0, 1, and 2 are points (vertices), straight-line segments 
(edges), and triangles, respectively. A simplicial complex 27 is a set of simplices 
such that no two simplices intersect, except when either a simplex is a facet of 
another simplex of higher dimension, or two simplices of the same dimension 
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share some facet. The carrier of a simplicial complex is the subset of space cov- 
ered by the union of all its simplices. A complex having a uniformly dimensional 
carrier is called a mesh. 

The star a* of a simplex ct € A is the set of all simplices for which ct is a 
facet. A vertex v said to be a manifold vertex if and only if either v* contains no 
triangle and at most two edges, or it consists of a single fan of triangles (i.e., an 
either open or cyclic sequence of adjacent triangles all sharing one vertex). An 
edge e is said to be a manifold edge if and only if its star e* contains at most 
two triangles. A simplicial complex S is said to be a manifold complex if and 
only if all its vertices and edges are manifold. 

The boundary of a fc-dimensional manifold mesh is the union of all its (A: — 1)- 
simplices that contain less than two fc-simplexes in their star. If the boundary is 
empty, then the mesh is said without boundary. The carrier of two-dimensional 
manifold mesh without/ with boundary is a closed/open surface; the carrier of a 
one-dimensional manifold mesh without/with boundary is a closed/open line. 

A two-manifold mesh is characterized, from a topological point of view, by 
the number of boundaries (more precisely, the number of connected components 
of its boundary), and the topological type of the surface obtained by closing each 
boundary with a disc. 



2 An Iconic Object Model 



Non-manifold, mixed-dimensional simplicial complexes are suitable to represent 
the shape of a 3D object as an aggregation of parts, where non-manifold edges 
and vertices represent junctions among different parts. In order to identify and 
explicitly represent the parts, we decompose a complex into maximal manifold 
components, each having a certain dimension, and we organize such components 
into a hypergraph representing their assembly structure. 

In the following, we give a constructive definition of the iconic model. We 
refer to Fig. ^ as running example. Let A be a non-manifold, mixed-dimensional 
simplicial complex. First, we find all non-manifold edges of A, i.e. those edges 
having three or more triangles in their star. We replace each such edge e with 
as many copies as there are triangles in e* (in Fig. nb, edges V\V 2 and V 2 V^ 
are replaced with three copies each). Then, we find all non-manifold vertices, 
by checking the star of each vertex. For each such vertex v, we decompose v* 
into maximal subsets, such that each subset is either a single edge or a fan of 
triangles. We replace v with as many copies of v as there are maximal subsets 
in V* (see Fig. Ct). Note that the replication of vertices and edges is a purely 
topological operation since each copy maintains its position. 

The above process decomposes A into a set of parts {Fi . . . Fm}, where each 
part is either a one- or a two-manifold simplicial complex. We call these parts the 
manifold components of A. Each component Fi is characterized by its dimension, 
number of boundaries and, if two-dimensional, by its orientability and genus. 
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Fig. 1. (a) A complex with three non-manifold vertices vi,V2, V3 and two non-manifold 
edges V1V2 and V2V3. (b) Splitting non-manifold edges, (c) Splitting non-manifold ver- 
tices. (d) The iconic hypergraph. 



The iconic model we propose consists of a hypergraph, that we call the Iconic 
Hypergraph (IH) . Its nodes are the manifold components of S, and its hyperarcs 
correspond to (groups of) simplices that connect different components. 

The hyperarcs of the IH are obtained as follows. We initially set one hyperarc 
for each non-manifold vertex v and for each non-manifold edge e of E, where the 
hyperarc connects all components that contain a copy of v and e, respectively. 
Two hyperarcs connecting the same manifold components are said to be similar. 
Each maximal set of similar hyperarcs, such that the union of their corresponding 
simplices is connected, is then replaced with a single hyperarc. Now, the set of 
simplices associated with each hyperarc is either a 0-dimensional complex (i.e., a 
single vertex), or a 1-dimensional complex which is not necessarily manifold. In 
the latter case, we decompose such 1-dimensional complex into manifold parts (in 
the same way as we did for E), and we replace its corresponding hyperarc with 
as many hyperarcs as there are manifold parts. We call junction the simplicial 
complex associated with a hyperarc since it represents a point or line where two 
or more components touch each other. 

The iconic hypergraph for the mesh of Figure [Q: is depicted in Figure Qd. 
For the sake of simplicity, we will draw manifold components through graphical 
symbols, namely blobs, sticks, and dots, depending on dimension. Hyperarcs will 
correspond to points where different symbols meet (see Figure El- 

The input complex E is encoded through a standard data structure for rep- 
resenting non-manifold, mixed-dimensional complexes m, which provides infor- 
mation sufficient to run the construction procedure efficiently. Each component 
T) (node of the IH) is encoded in a standard data structure for manifold triangu- 
lar meshes, in which every triangle is described by its three vertices and is linked 
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to its three adjacent triangles. For each component Fj we also store synthetic 
information (dimension, number of boundaries, orientability, genus), and the list 
of its incident hyperarcs. A hyperarc stores pointers to the nodes it connects, 
a representation of the associated junction (0- or 1-dimensional complex), and 
synthetic information for it. Further details on the definition and construction 
of the IH, as well as on data structures that we use to encode it are given in m- 

3 Non-manifold Simplification 

Initially, a two-manifold triangle mesh Eq describing an object is given. The 
iconic model associated with such mesh is trivial and consists of just one node. 
We introduce non-manifold simplification as a way to modify Eq by contract- 
ing the dimension of some of its parts, thus decomposing the object into parts. 
Non-manifold simplification is an iterative and progressive process, which incre- 
mentally produces local changes on a current mesh (initialized with Aq). The 
iconic hypergraph associated with the current mesh is affected as a consequence. 




Fig. 2. Transformations of a simplicial complex through edge collapses (white arrows) 
and vertex splits (dark arrows). 

Non-manifold simplification is based on the iterative application of a local 
modification operator, called edge collapse: an edge e = viV 2 shrinks to one point 
V along its length; each triangle in the star of e collapses to an edge, while the 
other simplices in u* and V 2 are deformed into simplices incident in v. Figure Et 
shows an example of an edge collapse in a manifold mesh. 

An edge collapse can produce non-manifold configurations and thus it mod- 
ifies the iconic model corresponding to the current mesh. In Figure Eb, two 
successive collapses create first a non-manifold edge and then a non-manifold 
vertex; in Figure Et, we have a similar situation in which, in addition, a part 
of the mesh becomes one-dimensional. Collapses, that lead to new non-manifold 



Non-manifold Multi-tessellation 



659 



configurations, can be detected by performing a local analysis of the portion of 
complex modified by the operation jTj. 




Fig. 3. A sequence of edge collapses simplifying an initial mesh into a single point. For 
simplicity, a 2D shape is considered. Arrows denote edge collapses. The small icon made 
of blobs and sticks depicts the iconic model associated with each simplicial complex in 
the sequence. 



A sequence of edge collapses may lead to a representation in which the dif- 
ferent parts of an object are described by subcomplexes of different dimension. 
Figure 0 shows a two-dimensional example of a sequence of edge collapses ap- 
plied to a mesh representing a lamp. For instance, collapse K identifies the cap 
and the base as distinct parts; collapse J further splits the base by revealing the 
one-dimensional nature of the stem; etc.; eventually, the whole shape is reduced 
to a single point. 

On the other hand, not all sequences are suitable to produce an iconic model. 
Figure 0 shows an example of a “bad” sequence. Our aim is to find sequences 
that can identify meaningful parts during the simplification process. This can be 
achieved by finding a suitable cost function that assigns to each edge e of the 
current complex a non-negative value, i.e., the estimated cost of collapsing e. At 
each step, the edge of minimum cost is collapsed. 

The cost function must consider a weighted combination of parameters mea- 
suring the changes in shape geometry and topology caused by an edge collapse. 
Cost functions proposed in the literature are aimed essentially at rendering and 
take into account only geometry. Examples are: length of the edge to be col- 
lapsed, variation in surface area, variation in differential properties, smoothness, 
preservation or destruction of sharp features. Simplification of mixed dimensional 
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Fig. 4. A simplification sequence that is not able to reveal the part-based structure: 
the base of the lamp is merged into the cap, instead of being identified as a separate 
part. 



complexes must take into account also changes in topology and in the structure 
of the iconic model. Exemples are: creation or elimination of non-manifold en- 
tities, generation of lower-dimensional parts, etc. Such issues are the subject of 
our ongoing research. We remark that, however, the whole simplification scheme 
is completely parametric over the cost function adopted. 

4 Non-manifold Mnlti-tessellation 

In this section, we introduce a multi-level iconic model, called Non-manifold 
Multi- Tessellation (NMT). This model is obtained by building on top of the 
sequence of edge collapses [ci, C 2 , • . • , Cm] described in the previous section. The 
basic idea of organizing simplification steps in a partial order is inherited from 
our previous multi-resolution model for the manifold case p]. 

We define top simplices in a complex E those simplices of S that have an 
empty star (i.e., they are not facets of other simplices). In the following, E is 
characterized just as the collection of its top simplices, all other simplices being 
implicitly represented as facets. We see the collapse of an edge e = viV 2 into 
a vertex v as an operation that removes the top simplices of and and 
replaces them with the top simplices of v*). 

We define a partial order over the set of edge collapses {ci, C 2 , . . . , Cm} as the 
transitive closure of the following dependency relation: a collapse Ci depends on 
another collapse Cj if and only if Ci creates some top simplices removed by Cj . The 
partially ordered set of edge collapses is represented graphically as a Directed 
Acyclic Graph (DAG) where the nodes are the collapses, and the arcs denote 
dependency links. Figure shows the DAG corresponding to the simplification 
sequence of Figure 01 

The above DAG encodes resolution changes at the granularity level of a single 
edge collapse, which is eccessively fine-grained for our purpose. Indeed, we wish 
to consider larger resolution changes, corresponding to sets of edge collapses that 
produce meaningful modifications in the underlying iconic model. 

We assume to have some criterion to select a subset R C {ci,...,Cm} of 
relevant edge collapses. We group the DAG nodes into clusters, each cluster 



Non-manifold Multi-tessellation 



661 




Fig. 5. (a) The DAG representing the partial order among the vertex splits of Figure 
0 nodes that change the iconic model are drawn round; dotted circles enclose nodes 
forming one cluster, (b) The resulting NMT after clustering; a consistent subset of 
nodes is highlighted, (c) The mesh and iconic model associated with such consistent 
subset. 



containing one relevant node Cj and all the non-relevant nodes lying on paths 
that start from Ci and end either at another relevant node, or at the common 
descendant of Ci and another relevant node. For each cluster, we form a macro- 
node^ and call Non-manifold Multi-Tesselation (NMT) the DAG resulting from 
clustering. Each node of the NMT corresponds to several collapses which are 
seen as a single, atomic modification of the simplicial complex and its iconic 
model. 

As an example, we may assume that any collapse modifying the topology of 
the complex is relevant (see Figure E|). Note, however, that this may be still not 
enough selective, since it might encode several intermediate step that do not nec- 
essarily correspond to meaningful modifications. We are currently investigating 
several definitions of a relevant edge collapse, and related clustering techniques. 
Note, however, that the above definition of NMT is completely parametric over 
the definition of a relevant edge collapse. 

The NMT can be used for obtaining iconic models at various levels of ab- 
straction, possibly different over the various object parts. Each such iconic model 
corresponds to selecting a subset S of nodes of the NMT and simplifying Sq 
through all the edge collapses that are not contained in such nodes. The selected 
set of macro-nodes must, however, be consistent with the partial order defined 
in the DAG. 

We say that a subset S of nodes of a NMT is consistent if, for every node 
u G S, all the nodes w, that are predecessors of u in the partial order, are also in 
S. Any consistent set S has an associated simplicial complex Us, and a related 
iconic hypergraph, which corresponds to an intermediate level of detail (see, e.g.. 
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Figure 0 . Us is obtained by starting from the single point and undoing all edge 
collapses contained in set S. 

Undoing an edge collapse means recovering the original edge e = V1V2 from 
the vertex v on which it was collapsed: such operation is called a vertex split. 
Note that consistency of set S with respect to the partial order implies that, 
when splitting v, the star of v is exactly the same as it was just after collapsing 
e during the simplification process. Vertex v is replaced by edge e; some of the 
edges in v* are expanded into triangles of e* ; each remaining top-simplex a G v* 
either replaces v with vi, or replaces v with V2, or becomes two simplices of the 
same dimension as a, one incident in v\ and the other one incident in V2- 

The NMT supports a dynamic algorithm to modify a mesh Ss and its iconic 
hypergraph representation while adding and removing macro-nodes to and from 
S. The algorithm performs a traversal of the DAG describing the NMT, starting 
at the given consistent set S, and involves two phases: a contraction phase, 
in which nodes are subtracted from set S (and V5 is coarsened through the 
corresponding edge collapses) in parts of the complex where resolution is too 
high; and an expansion phase, in which nodes are added to S (and Es is refined 
through the corresponding vertex splits) in parts where the resolution is not 
sufficient (see H2| for details). 

In order to support the dynamic algorithm, we use the data structure de- 
scribed in Section for encoding the iconic hypergraph associated with the 
current mesh As, a standard data structure for the DAG, and a data structure 
to represent the updates contained in each macro-node. To this aim, we store a 
reference to vertex v, two references to the endpoints v\ and V2 of e in the vertex 
table, an offset vector used to update vertex positions, and a bit-mask of split- 
codes that specifies how to transform each top-simplex a G v* when splitting v 
(see m for details). 

5 Conclusions and Future Work 

In this paper, we have presented an iconic model for three-dimensional objects 
which is based on the explicit representation of object parts through meshes, and 
of their assembly structure. We have introduced a multi-level model which spans 
a whole range of object iconic representations, and allows also for selectively ex- 
tracting different iconic representations for different object parts. A construction 
algorithm for the single-level model has been described, and a parametric con- 
struction technique for the multi-level model through progressive simplification 
and clustering has been outlined. Specific simplification and clustering strategies 
to reveal the part structure of an object starting at a detailed yet unstructured 
manifold mesh are the subject of our current investigations. 

Our aim is to build a model database supporting, for instance, queries by sim- 
ilarity and shape recognition with a hierarchical approach. The basic observation 
is that objects with a similar structure will have similar iconic representations at 
a low level of detail. Thus, we can design a database organized as an AND/OR 
DAG in which every object is represented by a subgraph, corresponding to its 
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NMT description. Objects which have common representations, up to a certain 
level of detail, share the same data structure, up to that level. Data structures 
for encoding such a database, techniques for its construction and update, and 
techniques for queries by similarity and model-driven object recognition will be 
the subjects of our future research. 
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Abstract. This paper describes three systems for image indexing and 
retrieval based on contour analysis. The systems compared are F-Index 
for Contours (fic). Hierarchical Entropy-based Representation (her) 
and Sketch-based query by Dialogue (sqd). The first system has been 
modified for contour-matching, since it was originally designed for a dif- 
ferent purpose. The choice of these specihcal systems has been made 
because of their similar conception, aim and computational complexity. 
An experimental and conceptual comparison has been carried out in 
order to assess retrieval precision, efficiency and usability. The results 
show that FIC and her have similar performance in the high end of the 
spectrum, while SQD has less precise retrieval and less efficient search. 



1 Introduction 

The days where computers were text-only systems are long gone. Nowadays, even 
low-end personal computers are able to display, employ and process images — at 
least in the form of icons, but usually in much more complex ways. The typi- 
cal user has several image files in his storage devices; the high-end user or the 
graphic specialist may well have thousands. Furthermore, there are computer 
systems that are entirely devoted to the archival or treatment of images: many 
multimedia databases are made in great part by images, and several applica- 
tions in a wide range of specific fields rely on digital images for many of their 
functions. As the number of images available to a system increases, the need for 
an automatic image retrieval system of some kind becomes more stringent. 

For a human being, it is very easy to recognize shapes and textures indepen- 
dently from their position and orientation; however, it is much harder to specify 
exactly the steps involved in such recognition, and this makes it difficult to de- 
vise an algorithm that can be programmed into a computer. Indeed, the general 
problem of image classification and retrieval by content, that is, based only on 
the actual content of the pictorial scene without the aid of textual labels or other 
metadata, is a rather hard one. Although no general solution to this problem 
has been found yet, there are a number of results that solve, at least partially, 
particular problems in specific areas. Indeed, it is much easier to devise an image 
retrieval system if the designer has a priori information on the type of images 
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involved in the queries. The resulting indexing system might not be suitable for 
general application, but this kind of ad hoc solution can be useful as long as it 
works in a specific environment. 

The techniques proposed in the scientific literature fall almost invariably in 
the category of feature extraction methods. The key idea is that of analyzing the 
pictorial scene in order to obtain n numerical features. By doing so, an image is 
mapped from image (or pixel) space into a single point in n-dimensional feature 
space, where traditional — and exact — spatial access methods may be used to 
retrieve points (i.e., images) that are close to the query image. This type of user 
interaction paradigm is called ‘query by example,’ because the user supplies an 
example (the query image) and the system looks for images that are close to it 
in feature space — and hopefully also from a perceptive point of view. Another 
possible paradigm for user interaction is called ‘query by sketch.’ In this case, 
the user draws a sketch of some object which should appear in the retrieved 
images. Depending on the nature of the feature extraction engine, either or both 
paradigms might be applicable. 

The underlying assumption with all feature extraction based methods is that 
proximity in feature space implies some kind of proximity in image space. Under 
certain hypotheses about the feature extraction process and the definition of 
distance in feature space, this is indeed the case. However, the type of features 
utilized and the organization of the information have a considerable influence 
over the usability of the system and the subjective perceived quality of the end 
result, which may vary widely from system to system. Also, the computational 
requirements can vary from lightweight through taxing to nearly infeasible. 

This paper describes three systems for image indexing and retrieval that ex- 
tract their features from an analysis of the contour. The pros and cons of each 
system are pinpointed and compared. The methods we will discuss are: (a) F- 
Index Q, a Discrete Fourier Transform based index, originally devised for time 
series and adapted to work with object contours specifically for the present com- 
parison(Fic); (b) Hierarchical Entropy-based Representation for object contours 
(her) I^; (c) Sketch-based Query by Dialog (sqd) 0. The F-Index and her 
were originally designed with a query by example interface, while SQD had a 
sketch-oriented interface with a particular attention to user interaction. How- 
ever, all of these methods can be queried by both paradigms. As we shall see in 
the next section, all of these systems should be at least robust to certain types 
of image transformations such as scaling, rotations and reflections. 

Other methods, which might be regarded as similar under several aspects, 
have been excluded from the comparison because their computational cost is 
significantly higher. Such is the case, for instance, of Elastic Matching 

The outline of the paper is as follows. Section 13 describes the methods one by 
one, discussing the relative strengths and weaknesses from the point of view of 
their design. The experiments that have been done in order to asses the methods 
from the practical point of view of their actual performance are presented and 
discussed in Sectional Finally, Section 0 draws a few concluding remarks. 
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2 The Methods 

The choice of the methods considered has been aimed at making the comparison 
as meaningful as possible. In fact, they all share several characteristics: 

1. The data that make up the index are derived from the contour of the most 
prominent object in the pictorial scene. This implies that all of these methods 
are intended to be applied to images of the ‘one object against a background’ 
type, so they work best with this kind of images. Another implication is 
that segmentation is a necessary step in order to separate the object from 
the background. The exact nature of the segmentation algorithm needs not 
concern us here, as long as it is the same for all the methods being compared. 

2. The execution time is in the low range of the spectrum of available tech- 
niques. As a consequence, all these methods can be effectively implemented 
on low end workstations. 

3. The performance of these methods is very similar as long as the index size 
is the same. However, two of these methods, namely her and Fic, provide 
a way to tune index size, while the third (sqd) produces indices of a fixed 
size, independently of both image size and user choice of parameters. 

4. The ability to offer feedback to the user as to the actual look of the query 
shape after feature extraction, so that it is possible to refine subsequent 
queries. This ability is integrated into the user interface of SQD, while for the 
other two methods the user must iterate the querying process until satisfied 
with the result. 

One aspect where they do differ is in the ability that the user has to adjust 
the operation of the system. FiC and her allow the user to trade off speed for 
accuracy and index size by changing the cutoff frequency (fic) or the number of 
maxima that end up in the index (her). In contrast, SQD has its feature space 
fixed at 4 dimensions, but it is only used for a filtering phase before ‘real’ search 
takes place in another space. On the other hand, SQD has several adjustable 
thresholds that may be tweaked in order to influence the size of the answer 
set. However, adjusting these thresholds does not often yield predictable results. 
The designers of SQD point this out: in their implementation, the values are 
hard- wired into the system based on experimental results 0. 

Another difference between fic and her on one side and SQD on the other 
is that while the former systems use a spatial access method for searching, SQD 
recurs to simple sequential search. 



2.1 F-Index for Contours 

The F-Index was first introduced by Agrawal, Faloutsos and Swami to perform 
similarity search in time series databases PJ. Among the methods we consider, 
it might well be the simplest to understand and implement. It has also been the 
first to appear in the literature, although the modification that allows it to be 
used for image contours (fic) has been purposely made for the present paper. 
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In order to obtain a time series from our contour data, we scan the contour 
clockwise starting from its top left pixel, recording the distance between each 
pixel and the center of mass, as shown in Figure ^ This yields a periodic time 
series with as many points as there are pixels in the object contour. 




N/2 N 

Fig. 1. Representing an A^-pixel 2-D contour (A) as a periodical 1-D time series (B) 



Once we have a time series a;(-) with N points, the steps performed to obtain 
its Fic representation are the following: 

1. Obtain the coefficients of its dft (Discrete Fourier Transform). 

where j = \/— 1 is the imaginary unit. This yields N complex numbers. 

2. Discard all dft coefficients but the first M. In other words, the parameter 
M is the cutoff frequency. 

3. Construct a multidimensional index in 2M-space using an appropriate spa- 
tial access method — in this case, an R*-tree |2]. 

The F-Index and its derivation FiC use the discrete form of the Fourier Trans- 
form PI, which has several nice well-known mathematical properties, most im- 
portantly linearity. Therefore, by invoking Parseval’s theorem, it can be proven 
that searching by the F-Index we can expect no false dismissals. In other words, 
images that lie within the specified distance from the query image will never fail 
to appear in the answer set. The reason is that since many Fourier coefficients 
are discarded, the distance between two items in feature space is less than the 
original distance in pixel space. This indeed makes sure that there are no false 
dismissals, but on the other hand it might introduce some false alarms that must 
be filtered out in a postprocessing step. 

Although the dft and the closely related dct (Discrete Cosine Transform), 
used by jpeg, are able to capture a good deal of information about images. 
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sharply straight lines can’t be effectively represented unless we are willing to use 
enough coefficients. As shown by Zahn and Roskies inj> an adequate approxima- 
tion of a polygon requires 15-30 coefficients. When the object has highly irregular 
or jagged contours, even 30 coefficients are not enough to characterize the shape 
adequately for accurate reconstruction. The increasingly good approximation of 
a shape by its dft is shown graphically in Figure 

Fic allows a reconstruction of the query shape based on the data from its 
DFT. However crude it might be, even a 3-coefficient Fourier approximation of the 
shape is something that the user can refer to in order to evaluate the effectiveness 
of the query. This undoubtedly enhances the usability of the system. 





Fig. 2. Approximation of a shape by the first M coefficients of its dft 



As for the spatial access method used in the FiC technique, it employs Eu- 
clidean distance as its metric in both image space and feature space. The data 
structure of choice, R*-trees, have been shown experimentally to perform well 
for dimensions up to 20 0, which means 10 complex Fourier coefficients. 

The user can adjust the behavior of fic by varying the cutoff frequency M, 
thus trading off search time and index size for accuracy. The designers of the F- 
Index experimentally found that 3 is a good value for the cutoff but they were 
dealing with ‘real’ 1-D data with spectrum patterns not unlike pink or brown 
noise, in which most of the relevant information involves medium- or long-term 
trends. In this case, we are dealing with 1-D encodings of 2-D data, which often 
makes it appropriate to increase the cutoff frequency slightly. 

2.2 Hierarchical Entropy-Based Representation 

As in fig’s case, also the Hierarchical Entropy-based Representation (her) tech- 
nique |0| was originally designed for the indexing and retrieval of time series. 
However, it has been shown to work well whenever the data, no matter what 
their origin, could be meaningfully made into a time series by some kind of trans- 
formation. In the case of image contours, this is done by scanning the contour 
in the same way as for fic. 

Supposing we have a time series x{-) with N points, let us define the energy 
of the f-th sample as E{i) = \x{i)\^. The total energy of x{-) is simply E = 
^(*)i while the relative energy of x{i) is = E{i)‘^/{E — E{i)). 

The HER representation vector y of the sequence x{-) is obtained as follows: 
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1. Find the signal maxima and put them in a queue Q in decreasing magnitude 
order, along with their x-axis positions. If the number of signal maxima is m, 
the queue Q now contains ^(ii, a;(ii)), 

2. Compute the relative energy Er{t) of the first (largest) maximum in Q, say 

x{t). 

3. Compute the standard deviation a{t) relative to the current maximum x{t) 
using 

a{t) = (2) 

In other words, we are considering x{t) to be the midpoint of a Gaussian 
distribution. Compute the entropy relative to x(t) as 



S{t) 



, cr(t) 

^ ^ \x{t + k)\. 

k=-a{t) 



( 3 ) 



4. Concatenate the values \ x{t), to the end of the her output vector y. 

Remove x(t) from Q. 

5. Go back to Step El until we have removed a predefined number M of maxima 
from the queue Q. 



An alternative form for Step 0 keeps on iterating until the fraction of the total 
energy remaining in the signal x(-) falls below a given threshold. In many cases, 
the alternate test offers more control on index accuracy at the expense of unpre- 
dictable index size. In order to perform our comparison with preset index sizes, 
we have preferred the simpler ‘number of maxima’ test. 

Differently from Fourier-based methods, this representation was never meant 
for reconstructing the signal. However, it is indeed possible to reconstruct the 
contour if one feature is added to the her representation: the angle made by the 
current maximum and some reference line — say, the positive X axis. This feature 
might be employed to enhance the system’s usability by providing the user with 
feedback about the actual appearance of the query shape, at the cost of a 33% 
increase in index size. In this case. Figure El shows how a her reconstruction 
changes when increasing the number M of maxima. In Figure El it is assumed 
that all interpolation between maxima is done by straight line segments. In prin- 
ciple it is possible to use curves as to fit the position i where the maximum x{i) 
occurs, but in practice the final effect is usually not worth the extra effort. 

The spatial access method used by her when processing queries does not use 
Euclidean distance in feature space. Rather, the distance between two represen- 
tation vectors JT and is defined as 



OO 

D{fi,y2) ^^Wik-yikV (4) 

k=0 

The data structure employed for the spatial organization of feature vectors is 
based on k-d-trees P|. 
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Fig. 3. Approximation of a shape by its M largest maxima using her 



2.3 Sketch-Based Query by Dialog 

The Sketch-based Query by Dialog method (sqd) ^ stands out in this group 
for two peculiarities. First, it was conceived from the start as an interactive, 
‘query, browse, refine the query’ method. Second, because of how the contour is 
sampled, the index has a fixed size independent of the original contour. Index 
size can’t be changed by tweaking any parameter, either. 

In order to obtain its SQD representation z (called ‘signature’ by its design- 
ers), a contour is sampled starting from the top left pixel at fixed 1° intervals, 
thus obtaining N = 360 samples at 6i = 2m jN for i = 0, . . . , N—1. The distance 
from the contour’s center of mass is recorded for each sample. If the contour is 
concave, there might be more than one point corresponding to a single angle; 
in this case, the maximum distance is recorded. An example of this process is 
illustrated in Fig. 01 sort, but it distorts the concavities, as shown in Fig. 0 
Compare Fig. 0]with Fig. 0 where the distance is recorded for each single pixel 
in the contour. 

In order to achieve scaling and rotation invariance, SQD normalizes the dis- 
tances with respect to the maximum distance and shifts the whole sequence of 
samples to put the farthest contour pixel in the first position. 



n/2 



3ti/2 



2n 



Angle 

Fig. 4. Converting the ‘F’ contour into a time series by sampling at 
ments (SQD signature) 



xed angle incre- 
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Fig. 5. SQD reconstruction 




Fig. 6. Converting the ‘F’ contour into a time series pixel by pixel (fic, her) 



The contour, and therefore the image, is represented by the whole signature z. 
In order to perform the search, 4 features are extracted from z: 

1. The sum of all N distances. This feature is larger for elongated objects and 
smaller for nearly circular shapes. 

2. The variance of the distances. This helps distinguish jagged, sharply varying 
contours from smooth ones. 

3. The ratio between minimum and maximum distance. A low value in this 
feature might point to a concave contour segment. 

4. The integral of the Fourier spectrum of the sequence of distances. Given the 
N complex dft coefficients X{f), f = 0,...,iV, the integral spectrum is 
given by the sum of their magnitudes: 

N 

S = Y,\X{f)\. (5) 

f=o 

This feature, like variance, is greater for sharply varying contours than it is 
for smooth ones, but it is more sensitive to local variations between pixels 
in the same neighborhood, that is, high frequencies in the dft. 

The system was originally designed for query by sketch, but query by example 
is also possible and it has indeed been used for our tests. There is no real spatial 
access method and distance in feature space is a highly nonlinear combination of 
individual feature-by-feature distances and inter-signature distance, all defined 
by absolute differences. 

Here is how searching works. When the user submits a query shape, the 
system goes through two steps: first it evaluates 4 distances for each item in the 
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database — one for each feature. These distances are compared against predefined 
thresholds and each time a distance falls below the threshold the candidate 
database item scores a point. At the end of first step, each database item has 
therefore a score between 0 and 4. The second step is restricted to items that have 
scored over another ‘score threshold.’ Here, the system computes the distance 
between the item’s and the query’s signature, defined as 



which is the final ranking factor for the presentation of the answer set. 

A remark about this searching scheme is in order. The first step involves a 
sequential scan of the whole database, which can result in long searching times 
if the number of items is substantial. However, in principle it would be possible 
to implement this phase in a smarter and more efficient way utilizing a spatial 
data structure, especially considering that feature space dimensionality is low 
by design (only 4 features). 

3 The Experiments 

The systems under examination have been tested by means of a series of ex- 
periments aimed mainly at assessing effectiveness, efficiency and usability. All 
the tests have been performed on a heterogeneous database of about 400 images 
that includes tools, animals and pasta. This database consists of the Scaroff 
database EllIDl plus a few additions. The results are averaged over 20 queries 
for each system. 

Effectiveness, that is, precision of the retrieval, has been measured by the 
quantity known as Normalized Recall (NR), defined as follows. Consider a data- 
base D of \D\ objects where the number of objects relevant to the query is fV < 
\D\. Suppose that the objects are sorted so that the most relevant object is X\ 
and the least relevant is A^r. Let A be the ordered answer set returned by a 
query, and let be the rank of z-th most relevant object Xi in A. The ideal rank 
(IR) of the query is then defined as 



Note that the ideal rank does not depend on A. The average rank (AR) of A is 
then 



The difference AR — IR gives a measure of the precision achieved by the query. 
This quantity is usually normalized in order to obtain a value between 0 and 1 — 
the Normalized Recall. 



N-l 



■D(y,z) = ^ Iy(fc) - 2 {k )\ , 



( 6 ) 




(7) 




( 8 ) 




( 9 ) 



674 



R. Distasi et al. 



As Fig. □ shows, Fic and her have very similar performance, while SQD does 
not achieve the best results. 




Fig. 7 . Normalized Recall as a function of the number of coe cients 



Invariance various kinds of image transformations is also an issue that influ- 
ence the quality of retrieval. These systems all include some method for ensuring 
at least robustness, if not exact invariance, to image rotations, reflections and 
scaling. Indeed, the representations used by FiC, her and SQD are intrinsically 
invariant to rotations and nearly invariant to scaling — if we neglect the discrete- 
ness of pixels as opposed to continuous Cartesian space. 

The next illustration (Fig. 0 shows how fic, her and SQD respond to the 
same sample query. All the results have the query image in the upper left, while 
the answer set, sorted best match first, is shown on the right. In order to make 
the comparison as fair as possible, in this case fic and her have been set up to 
use 4 real numbers for indexing as SQD does. These results are fairly represen- 
tative of the standard behavior exhibited by these systems in that SQD returns 
a significantly higher number of false alarms. The rabbits returned by fic and 
SQD can be justified by noting how similar these rabbit pictures are to a fish: the 
snout and ears can be mistaken for a fish tail, while the fore legs bear a definite 
resemblance to a ventral fin. 

The efficiency of the systems has been assessed by considering index size 
and response time. The three systems have indices of similar size, at least for 
the parameter values that have been considered. Both fic and her allow the 
user to adjust the dimension of feature space — and consequently the size of the 
index — by varying the cutoff frequency (fic) or the number of maxima (her). 
On the other hand, SQD has its dimension fixed to 4. Predictably, fic and her’s 
indices are slightly larger. 
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(a) (b) 




Fig. 8. Sample queries: FIC (a); her (b) and SQD (c) 



As for the response time, arguably more important, SQd’s is perceivably 
longer because of the lack of a spatial access method in the search phase. In fact, 
fig’s and her’s searching times are about an order of magnitude shorter than 
SQd’s. This result should not be a surprise: for these moderate feature space 
dimensions (1 to 6) the asymptotic performance of FiC and her is basically 
logarithmic in the size of the database, while SQd’s is linear. 

Finally, a few words about usability. With FiC and her, the user can see 
what the index data ‘looks like.’ In other words, it is possible to evaluate the 
quality of the index data a priori by looking at the contour reconstructed by 
means of the very same data that will be used to perform the actual search. 
This is shown in Figs. 0and0 On the other hand, SQD reconstructs the contour 
based on the whole ‘signature,’ but the search — at least in the first phase — is 
performed by utilizing only 4 numerical features extracted from the signature. 
It is not possible to attempt any contour reconstruction from these 4 numbers, 
so it is harder for the user to predict the quality of the results in a reliable way. 

4 Conclusion 

This paper has described three systems for image index and retrieval based 
on contour analysis: the F-Index for Contours (fic), the Hierarchical Entropy- 
based Representation (her) and the Sketch-based Query by Dialogue (sqd). 
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The three systems have been selected on the basis of their similar concept, aim 
and computational complexity. The F-Index for Contours was initially designed 
for time series matching ^ and modified specifically for this paper. 

Experiments were made with the intent of assessing the effectiveness (re- 
trieval precision), efficiency and usability of the systems. The experimental re- 
sults show that Fic and her have similar performance, with a Normalized Recall 
well above 0.9 if enough coefficients are extracted from the data. SQD, on the 
other hand, has less precise retrieval but emphasizes and supports user interven- 
tion to iterate the search until the answer set ‘converges’ to the desired results. 
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Abstract. This paper clarifies a sufficient condition for the reconstruc- 
tion of an object from its shadows. The objects considered are finite 
closed convex regions in three-dimensional Euclidean space. First we 
show a negative result that a series of shadows measured using a camera 
moving along a circle on a plane is insufficient for the full reconstruction 
of an object even if the object is convex. Then, we show a positive result 
that a series of pairs of shadows measured using a general stereo system 
with some geometrical assumptions is sufficient for full reconstruction of 
a convex object. 



1 Introduction 

The reconstruction of three-dimensional shapes from measured data such as 
range data, photometric information, and stereo image pairs, is called “Shape 
from X.” In this paper, we deal with “Shape from shadows.” This problem is also 
called ‘Shape from counter,” ^ and “Shape from profile” P| in computer vision 
and “Shape from plane probing,” in computational geometry. In computer 
vision, theoretical analysis of reconstruction algorithms is paid little attention. 

This paper proves that a series of shadows measured using a camera moving 
along a circle on a plane is insufficient for the full reconstruction of the visible 
hull of an object. Although this type of measuring system is sometimes used in 
computer vision, our results show that we cannot reconstruct full profiles of ob- 
jects using such a camera system even if the objects are convex. Next, we prove 
a positive result that a series of pairs of shadows measured using a general stereo 
system with some simple geometric assumptions fully reconstructs convex ob- 
jects. Then, using the same mathematical ideas with “Shape from shadows,” we 
prove similar sufficient condition for “Shape from range data” and ’’Shape from 
photometric data.” We also clarify the relation between “Shape from shadows” 
and image reconstruction from line integrals using the characteristic functions 
of line integrals. We prove similar properties between the two problems for the 
orbits of source points. The orbits are spatial curves on which the eye center of 
the camera system and the x-ray source move for “Shape from shadows” and 
image reconstruction from line integrals, respectively. 

The illumination problem ^ estimates the minimum and maximum numbers 
of view points for the reconstruction of a convex body from their views from an 
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appropriate set of view points. The illumination problem is equivalent to shape 
reconstruction from silhouettes or shadows. However, it is in general difficult to 
answer the configuration of view points for a given object. There are many results 
for the reconstruction of a convex polygon from their shadows. For example 
see 1^, and 0. Laurentini m was concerned with geometric properties of 
silhouette-based shape reconstruction for polyhedrons, and clarified the relation 
among the visible hull and the convex hull of a polyhedron. 

Tuy m] proved that for a positive functions defined in a finite closed convex 
region in three-dimensional Euclidean space, it is possible to reconstruct the 
function from line integrals measured by cone beams, if source of line integrals 
moves on a pair of circles with the same radius lying on a mutually perpendicular 
planes which encircle the region. Shape from perspective projections has relations 
with the cone-beam reconstruction problem since it is possible to determine 
boundary from line integrals. 

2 Shape Reconstruction from Support Planes 

In this section, we summarized the result of convex geometry in three- 
dimensional Euclidean space R® ca, since the analytical relations between a 
convex object and its tangent planes were sometimes dealt in a well-described 
form for computer vision by several authors mm . Let X — y — z he a.n orthog- 
onal coordinate system in R^. We call the system the world coordinate system. 
We denote a vector in the world coordinate x = (x,y,z)^, where T means the 
transpose of a vector. Setting y to be the inner product of x and y, we define 
the length of a vector as \x\ = V x^ x. Thus, \x — y\ is the Euclidean distance 
between x and y. Furthermore, setting 0 < 0 < tt, and 0 < (j> < 27t, we define a 
rotation matrix 

( cos 4> sin 9 sin 9 sin (j) cos 9 \ 
cos (/) cos 0 sin (/) cos 0 — sin 0 . (1) 

— sin (j) cos 4> 0 y 

Next, for the basis of the world coordinate system ei = (1, 0, 0)^, 62 = (0, 1, 0)^, 
and 63 = (0, 0, 1 )^, we define a set of orthogonal basis vectors 

e^ = RJei, i=l,2,3. (2) 

Thus, we obtain the relation HJ = ( 61 , 6 , 63 ). Setting D to be the vector gra- 
dient operator on the unit sphere 

where 9 and (j) are the polar angle and the longitude on the sphere, the relation 
Dei = ( 62 , 63 ), holds. This equation leads to the equation R = (ei,ZDei). 
Setting AT to be a bounded closed convex set in R®, we denote the boundary of 
K as dK. If a plane touches iF at a point on dK, this plane is called a support 
plane of K . We set that h{9, (p) is the Euclidean distance between the origin of 
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the world coordinate system and the support plane of K, the normal vector of 
which is ei. h(9, (j)) is a function on the unit sphere. K exists in a half space, 

n{e,(^) = {x\x^e^<h{e,c^)}. (4) 

Therefore, we call a plane 

x^ei = h{e,(j)), (5) 

a support plane of K. 

Furthermore, let V be the scaler gradient operator on the unit sphere S^; 
that is, for function (j)) defined on the unit sphere. 

The following proposition is a well-known result in convex geometry II 21 . 
Proposition 1 Let h = . Then, X S OK is obtained by 

X = Rh. (7) 

Equation o is called the support plane expression of a convex body. From eq. 

if the normal vectors of support planes are defined, we can obtain the support 
plane expression of an object. 

Several authors in the computer vision and image processing field re-found 
the eqs. 0, and its two-dimensional version mm- However, the support func- 
tion is the fundamental results in convex geometry H2|. Figure 1 (a) shows the 
relation between a convex object and the support plane. 




Fig. 1. The con gurations of a convex object and its support plane (a) and the three 
orthogonal vectors (b). 
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3 Shape Reconstruction from Perspective Projections 



For a point a = {a, b, c)^, setting —a to be the positive direction of the ^ axis, and 
a to be the origin of a coordinate system, we define a right-handed orthogonal 
coordinate system ^ — f] — We call the system the observation system and 
denote a vector in this coordinate system as Thus, the world 

coordinate system and the observation coordinate system are related by 

X = U'd — a (8) 



for an appropriate rotation matrix U, where U depends on the directions of 
the ^ and t] axes. Setting a to be the camera center and such that 

I/I < |a| and /|a| > 0 to be an imaging plane, we denote the shadow of K 
on an imaging plane as K{a) and denote the boundary of shadow as dK{a). 
Furthermore, we denote the boundary curve of dK{a) as 

'r{t,a) = {{£,{t,a),r]{t,a)J)Y, 0<t<T{a), (9) 

such that r(t,a) = r{t + T{a),a) where T{a) is the total length of dK(a), since 
dK{a) is a closed curve on an imaging plane. The vector 



i(t, a) 



Ur{t, a) 
\r{t, a)| 



( 10 ) 



is the N-vector of r{t,a) in the world coordinate system [I3|. l{t,a) moves on 
a closed curve dL{a) on the unit sphere for each a. We call these closed curve 
dL{a) the N-curve of r{t,a). Next, setting 



i{t, a) 



_9 

Ft 






l{t, a) 
i(t, a) 



( 11 ) 



we obtain the relation 



l(t, a)^m(t, a) = 0. (12) 

Furthermore, m{t, a) is the normalized tangent vector of dK{a). From eq. (II 2D . 
setting 



n{t, a) 



l(t, a) X m(t, a) = l{t, a) x 



a) 

a) 



(13) 



we obtain a moving orthogonal basis {l(t, a), m{t, a),n{t, a)}. Figure 1 (b) shows 
relations of these three mutually orthogonal vectors. Moreover, n{t, a) is the 
normal vector of a plane which touches K and passes through point a. Thus, a 
support plane of K is given by 



P{a) = {x\x^ n(t, a) 
From eq. (ED, setting 



6i = Z(t, a) X 



= aJ n{t, a)} . 


(14) 


Z(t, a) 




(15) 


i{t, a) 


5 
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we can obtain R and /i(0, (j)), using the relations 



R = 



a) X 



i{t, a) 
i{t, a) 



,D 



a) X 



i{t, a) 
i{t, a) 



(16) 



and 



h{9, (j>) = aJ 



a) X 



l{t, a) 
i(t, a) 



(17) 



Thus, eqs. (nn» and (El) imply that we can reconstruct a convex body from its 
shadows, if we obtain n{t, a), which is defined by eqs. (im and m- Equations 
and El yield the following theorem. 



Theorem 1 If vector n(t, a) is measured all over the unit sphere, it is possible 
to obtain full reconstruction of a convex body from its shadows. 



Theorem 1 is a modification of proposition 1, which is a classical result of con- 
vex geometry. It is, however, important from the standpoint of computer vision 
because the theorem shows a sufficient condition for shape reconstruction from 
shadows obtained by perspective projections. Also, from eqs. El and (El , we 
can reconstruct a convex object using only boundary information of shadows. 



4 A Sufficient Condition for Reconstruction 

4.1 Shape from Shadows 

For a point x G K{a), setting x{a) to be the N-vector of x, we define a cone 

C{a) = {x\x = Xx{a),\ > 0} . (18) 

We call C{a) the view-cone at a. The boundary of C{a) is 

dC{a) = {x\x = Xlft, a),l{t, a) C dL{a),X > 0} . (19) 

If a pair of view-cones which have the same vertex satisfy the relation 

Cl (a) C C 2 (a), (20) 

we write dLi{a) ^ dL 2 {a), where dLi{a) is called the associated N-curve of a 
view-cone C(a). 

If l{t,a) moves on dL{a), n{t,a) moves on the unit sphere and forms a 
closed curve dN{a), which we call the orthogonal N-curve. From geometrical 
consideration, it is obvious that if dLi{a) ^ dL 2 {a), then dNi{a) ^ dN 2 {a). 
Furthermore, setting 

dC{a)^ = {x\x = Xn(t, a), A > 0} (21) 

if eq. (Er!|l holds, then the relation C\{a)^ D C 2 (a)'*“ holds. Setting Ci(a) and 
C 2 {a) to be view-cones of K and i? to be a sphere which encircles K, respectively, 
we obtain the following theorem, where the origin of the world coordinate system 
is at the center of B. 
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Theorem 2 For a bounded closed set A in R3, tf 

U C2(a)^ D R3, (22) 

aeA 

then we can reconstruct K from shadows which are obtained by perspective pro- 
jection. 

(Proof) \/a G A C R^, the relation Ci{a)^ C R^ holds. If UaeA ^'2(0)-'- D R®, 
then we obtain 

C y C2(a)-L c U Ci(a)-L C R 3 . ( 23 ) 

aeA aeA 

This relation concludes the relation C'i(a)-*- = R^. Furthermore, this 

equation leads to UaeA *^1(^7 ®) = (Q. E. D) 

In the following, we show some examples of a bounded closed set A. 

Example 1 Let Pi and P2 be a pair of perpendicular planes which pass through 
the center of B. Setting ai and 02 to be circles on Pi and P2 of the center of 
which are at the center of B with radii a and b, respectively, if 

a -2 + 6-2 > r- 2 , ( 24 ) 

where r is the radius of B, then ai U 02 is an example of A. 



Example 2 Let Pi and P2 be a pair of parallel planes which touch B. Setting 
ai and «2 to be circles with the radius d, the center of which are on B, if d > r, 
di U d,2 is an example of A. 

In Figures 2 (a) and 2 (b), we show the spatial configurations of K, which 
is enclosed in B, B, and A for examples 1 and 2 , respectively. It is clear from 
the figures that the configurations of examples 1 and 2 satisfy the condition of 
theorem 2 . These examples show that for the reconstruction of a convex object 
from its shadows the orbit on which the camera center of perspective projection 
moves is a one-dimensional manifold, or a collection of curves in R®. 

The camera orbit configuration of example 2 has the significant property that 
if the regions of interest of two cameras overlap for each a, we can apply the 
method of shape from stereo pairs. In this case, the stereo images are measured 
using a general stereo system, two optional axes of which are not parallel. The 
most interesting camera configuration is that in which the epipolar line of the 
stereo system is perpendicular to each circle on which the camera moves. 

4.2 Shape from Range Data and Photometric Data 

The main idea used in the analysis of the problem is that shadows of convex 
objects are functions on the unit sphere. This property yields a sufficient condi- 
tion for the reconstruction of objects from the boundaries of their shadows. The 
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Fig. 2. The Configuration of the camera motions of examples. 



range date and the photometric data of a convex object are also functions on the 
unit sphere. Using this property we also derive a sufficient geometric condition 
for full recovery of convex objects from range data and photometric data. Here, 
we assume that the range data and photometric data are also obtained using 
perspective projection geometry. 

Setting d{x, a) to be range data or photometric data at i G K{a), for a fixed 
a, d{x, a) is a function of x{a)^ G C{a)^. This geometric property implies that 
for a bounded closed set A in if 

y x{a)-^ = S^, (25) 

aeA 

then we can reconstruct objects from the pair (a;(a)-*-, d{a,x)). Since eq. (I25II 
is equivalent to 

y C{a)^ = R^, (26) 

aeA 

we obtain the following theorem as in the case of shape reconstruction from 
shadows. 

Theorem 3 For a bounded closed set A which satisfies eq. M^M) if d{x, a) is a 
function on S^, we can reconstruct dK. 

Theorem 3 is valid for non-convex objects if d{x, a) is a function on S^. Thus 
we obtain the following theorem 

Theorem 4 For a bounded closed set A in R^, we can reconstruct dK from 
dix^a), when C{a) is the view-cone of a sphere B which encloses K. 

This theorem also leads to the same configurations of camera orbits as ex- 
amples 1 and 2 for range date and photometric data obtained using perspective 
projection geometry. 
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5 Tomography and Voting Method 

Setting f(x) to be a positive, and integrable and square integrable function 
defined in K, for a a G and a) G S^, 



is the divergent x-ray transform HS|. Reconstruction of f{x) from g{a, ui) is the 
mathematical model for the reconstruction of volume distributions form cone 
beams for the x-ray computerized tomography. If a moves on the beam-source 
orbits which is same with the view-points orbit defined in example 1 for a = b, 
it is possible to reconstruct fully f{x) from g{a,u}). Therefore, from g{a,u>), we 
can reconstruct dK in the same condition with example 1. 

On the other hand our data are shadows of f{x) measured by perspective 
projections. Therefore, denoting the characteristic function of g{a^uj) and the 
ray cone as 



where dC{a,u>) is the boundary of C{a,u}). 

The support plane method for shape reconstruction is an algebraic expression 
of the second equation of eq. (I.Sl)y . These geometric properties show a mathemat- 
ical relationship between “Shape from shadows” and the image reconstruction 
form projections of the x-ray computerized tomography since “Shape from shad- 
ows” focuses to shape reconstruction. 

Setting the characteristic function in the view cone to be 




'OO 



f{a + 



(27) 




(28) 



and 



C{a,u:) = {(a,o;) | x(a,u;) = 1, a G R^,u; G S^}, 
respectively, we obtain the following relations 



(29) 




(30) 



aeA 



aeA 




(31) 



if we vote c{x; a, tu) in to the space, we have a function 




(32) 



aeA 



as the results of voting. For a positive integer r, a set of points 

Kr = {x\,k{x) > t} 



(33) 



defines an object. The construction of shape by K^. is called shape reconstruction 
by voting. 
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If an object is a collection of points in a space, projection K{a) is a collection 
of points on a plane for each view point a. Furthermore, if an object is a collection 
of line segments in a space, projection K{a) is a collection of line segments 
on a plane for each view point a. A point and a line segment on an imaging 
plane determine a half line and a fan whose origins are vector a, respectively. 
We can reconstruct the position of a point and a line segment in a space as 
the common sets of a many half lines and many fans, respectively which are 
measured from verious directions. A point and a line segment are convex objects 
whose dimensions are one and two, respectively. A polyhedron is a collection of 
vertices (points) and edges (line segments) on a closed surface. These geometric 
properties conclude that the voting method permits us to reconstruct a class of 
nonconvex polyhedrons without holes from a series of images, if each vertex and 
each edge of a polyhedron are measured in several images m- In Figures 3 (a) 
and (b), we show an image of a nonconvex polyhedron, which are collection of 
points on a plane, and the reconstructed polyhedron by voting of half lines in a 
space, respectively. 





Fig. 3. An example of the reconstruction of a nonconvex polyhedron by voting of half 
lines in a space. 



6 Conclusions 

Our results have mathematically clarified that for full recovery of convex objects, 
two cameras each of which moves on a circle are sufficient. We also clarified the 
relation between shape reconstruction from shadows and shape reconstruction 
by voting, which is used for the model generation for mixed reality HH. 

We also showed that the voting method permits us to reconstruct a class of 
nonconvex polyhedrons form a series of images using projections of vertices and 
edges which determine half lines and fans, respectively. 
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Abstract. In this report, we present a 3D shape modeling method us- 
ing the shape’s silhouettes from multiple views to determine the model 
(polyhedron) parameters. The polyhedron parameters are determined 
by neural networks, each of which represents the model’s silhouette ob- 
served from a view point, and determines the polyhedron parameters by 
the back propagation algorithm so that the model’s silhouette from each 
view approximates the corresponding silhouette of the target shape. By 
conducting basic experiments, we verihed the effectiveness of the method. 



1 Introduction 

In many cases, a 3D shape model is manually designed by specifying parameters 
of a polyhedron so that the polyhedron provides a good approximation of the 
target shape. However, in order to reduce the amount of time and labor required 
for designing, interactive design tools with user friendly graphical interface 0 
and automated systems with range measuring devices have been developed. 

To save measurement cost, instead of directly measured 3D data, multiple 
2D views have been used for disparity-based 3D reconstruction 0, or generating 
new views by image-based rendering methods without a 3D model[TTj. 

In this paper, as a part of attempt to simplify measurement and automate 
the 3D modeling, we present a method to construct 3D shape model for a target 
shape by deforming the 3D shape model so that each of its silhouettes from mul- 
tiple viewpoints fit to a corresponding silhouette of the target shape. Although 
some shapes with concave surface cannot be modeled due to the limitation of 
the silhouette, accurate shape modeling is expected for many practical shapes. 

2 Analytical Silhouette Representation by a Neural 
Network 

A number of deformable shape models have been developed Q |2| Q |S| 1101 • 

Among them, the shape representation neural network|2| has the following ben- 
efits. 
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(i) It provides fully analytical representation and partial derivatives with re- 
spect to model parameters are analytically computed for gradient descent 
algorithms to update model parameters iteratively. 

(ii) Represented shapes are blurred just by changing gain parameters of sigmoid 
functions. Coarse-to-fine matching is easily implemented by changing gain 
parameters during matching process. 

(iii) Shape approximation ability of multi-layer network of nonlinear units is 
higher than that of linear combination of base functions. 

(iv) A number of powerful learning algorithms for multi-layer neural networks 
such as back propagation can be used for shape deformation. 

In this paper, we use a polyhedron to represent a 3D shape. The positions of 
vertices are parameters to specify the polyhedron. Each face of the polyhedron 
is a triangle and its projection onto an observation plane is also a triangle. Ac- 
cording to the shape representation neural network, a silhouette of a polyhedron 
is represented as follows. 

Each triangular face of the polyhedron is projected onto the observation 
plane. This projected face is described by taking an AND operation over the 
half plane regions with borders corresponding to the three edges of the triangle. 
Each half plane is represented by a neuron as shown in Fig. 0 The neuron in 
this figure outputs 1 for inputs (x, y) inside the half plane: ax + by + c > 0 
and outputs 0 for inputs (x,y) inside the half plane: ax + by + c < Q. When 
the sigmoid function is used for the neuron’s activation function. The edge of 
the half plane is blurred by decreasing a edge in Eq.lQ as the sigmoid function 
changes its graph as shown in Fig.n The whole silhouette of the polyhedron is 



Sigmoid 




Fig. 1. Sigmoid functions for various gain values. 



y • * O 

c sigmoid 

I • 




Output of f{x, y) 



Fig. 2. Representation of a half plane by a single neuron. 
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represented by taking an OR operation over the projected triangular faces. As 
the AND operation and the OR operation are computed by neuron functions, 
this whole computation is executed by a multi-layer neural network. 



2.1 Representation of a Half Plane Region 

The function f{x,y) which outputs 1 for inputs {x^y) inside the half plane: 
ax+by+c > 0 and outputs 0 for inputs {x, y) inside the half plane: ax+by+c < 0 
is realized by a sigle neuron as follows. 



sigmoidis, a) = 

1 -I- 



( 1 ) 



f{x, y) = sigmoid{ax + by + c, aedge) 



( 2 ) 



2.2 Representation of a Triangular Region 

A triangular region is obtained as an intersection of three half planes. The inter- 
section is obtained by AND operation over three functions fi{x,y){i = 1,2,3) 
where fi{x,y) represents the i — th half region. The sigmoid function with a 
sufficiently large gain value with an appropriate threshold performs the AND 
operation in the following equation. 



F{x,y) = sigmoid{fi {x,y) + /2 {x, y) 

+f 3 {x,y)-e, a and) (3) 



Thus the triangular region is represented by a neural network shown in FigH 
where the Translation part performs conversion (FigEJ between three vertices of 
the triangle Vi(i = 1,2,3) and the line parameters ctj = (ai,bi,Ci)'^{i = 1,2,3) 
by the following equations. 



a. = 




( 4 ) 

( 5 ) 



2.3 Representation of Perspective Projection of a Triangular 
Region 

Each triangular face of the polyhedron in the 3D space is projected onto an 
observation space as shown in Fig. 0 According to this projection. Vertices 
= 1, 2, 3) in the 3D space is transformed to vertices Vi{i = 
1, 2, 3) on the observation plane. This transformation is dependent on the camera 
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Fig. 3. A triangular region with vertices ui, V 2 ^vz 



Transfonn 



Trans 

• \c/- o •/•■(.<■.>■) 

' AMI) 



I • 



O/, 




Output of F{x, y) 



Fig. 4. Representation of a triangular region by a neural network. 




Fig. 5. Perspective projection of a triangular surface. 



position, the view angles and the focus distance. These camera parameters are 
denoted by # = ,C = {Cx,Cy,Cz)'^ , and / Using these parameters, 

the perspective projection is formulated as follows. 



Ry{e) 

RxW 



cos (p — sin <j) 0 
sin (j) cos (p 0 
0 0 1 

cos 9 0 sin 0 
0 1 0 
— sin 0 0 cos 9 

1 0 0 
0 cos ip — sin ip 
0 sin Ip cosip 



= R^{iP)Ry{9)R^iP) 



V 

^ I 






(6) 

( 7 ) 

(8) 
(9) 



(10) 
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Output of F{x, y) 



Fig. 6. Representation of perspective projection of a triangular region by a neural 
network. 



= R(^)V, + C (11) 

(* = 1,2,3) (12) 

Figa shows a way of representing a perspective projection of a triangular 
region by a neural network. In this figure, Project Part performs the projection 
formulated by Eq. (0-dnj. 

2.4 Representation of a Silhouette of a Polyhedron 

The whole projection of a polyhedron, that is observed as a silhouette of the poly- 
hedron from a viewpoint, is obtained by taking OR operation of the projected 
triangular regions as shown in Fig. 0 This combination of projected triangular 
regions is computed by the neural network shown in Fig. 0 In this figure, a 
switching operation is introduced instead of OR operation to reduce the com- 
putation cost. When computing F{x,y), a triangular region far from the point 
(a;, y) does not affect the result as the sigmoid function decays so fast as shown in 
FiglU From this reason, we can exclude the triangular regions far from (x, y) for 
computing F{x,y). The switching operation is used to choose sufficiently near 





Fig. 7. Perspective projection of a polyhedron 
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triangular regions. Due to this switching operation, the function F{x, y) is no 
longer analytical. However, as the result by switching provides an approximation 
for the analytical case of using OR operation, the benefit of the shape represen- 
tation network still exists. An example of polyhedron silhouette is illustrated in 
Fig. 1^ In this example, edges of triangular regions are indicated by thin lines to 
understand each triangular shape. 

3 Shape Reconstruction by Training Neural Networks 

Silhouettes of a polyhedron model are computed for multiple viewpoints as shown 
in Fig. El These silhouettes of a model are compared with the corresponding sil- 
houettes of the target shape. The vertices of the polyhedron = 1, 2, • • • , K) 
are modified for the model to provide better approximation. This modification 
is performed by the gradient descent algorithm as follows. 

(i) The energy function to evaluate the difference between the model silhou- 
ettes y){m = 1, 2, • • • , M) and the corresponding target silhouettes 

y){m = 1, 2, • • • , M) is defined by 



e^^Hx,y) = \F^^\x,y)-G^^\x,y)\ 


(13) 


x,y 


(14) 


i; = 


(15) 
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(ii) Compute partial derivatives de/dVk by the back propagation algorithm. 
As a switching operation is introduced instead of the OR operation, the 
backward signal flow occurs as indicated by a mesh region in Fig. 1701 

(iii) Update iteratively the vertex positions of the polyhedron model by 

f)7^ 

Vk = V>,-y— {k = l,2,---,K) (16) 

where r] is a, small positive number. 



4 Experiment 

Two artificial target shapes. Fish and Airplane, were used for the experiments. 
These shapes were numerically produced using polyhedrons with sufficiently 
large number of vertices. Their silhouettes were computed numerically for var- 
ious views. The size of polyhedron (the number of vertices) used as a shape 
model was 62 for Fish and 114 for Airplane. Silhouettes from 5 views were used 
in the training for Fish and silhouettes from 3 views were used in the training for 
Airplane. The silhouette from View No.O was not used for the training and used 
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Output of F{x, y) 



Fig. 8. Representation of a perspective projection of a polyhedron by a neural network 





Fig. 10. Backward signal route of the back propagation algorithm. 
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NN’s Output for Novel View Target 




viewl 



view2 



view3 



view4 



view5 



NN’s Output for Trainig Views Target 

Initial shape 50th epoch 100th epoch 



• I • I • ■ • 




Fig. 11. Exp.l Shape modeling experiment for a fish shape. A polyhedron with 62 
vertices was used. 



to test the generalization ability of the learning algorithm. The result for Fish 
was good but some errors still remained for Airplane even after 2000 times of 
parameter updating. The three views seemed to be insufficient for reconstructing 
airplanes. 
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Fig. 12. Exp. 2 Shape modeling experiment for an airplane shape. A polyhedron with 
114 vertices was nsed. 



5. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J., 1992 Training Models of 
Shape from Sets of Examples, Proc. British Machine Vision Conference, Springer- 
Verlag, 9-18. 

6. Hartley, R.I., 1997. In defense of the eight-point algorithm, IEEE Trans. Pattern 
Analysis and Machine Intelligence 19,(6), 580-593. 

7. Igarashi, T., Matsnoka, S., Tanaka, H., 1999. TeddyiA Sketching Interface for 3D 
Freeform Design, ACM SIGGRAPH’99, Los Angels, 409-416. 

8. Jain, A.K., Zhong, Y., Lakshmanan, S., 1996. Object Matching Using Deformable 
Templates, IEEE Trans. Pattern Analysis and Machine Intelligence 18, (3), 267- 
278. 

9. Kumazawa, I., 2000. Gompact and parametric shape representation by a tree of 
sigmoid functions for automatic shape modeling. Pattern Recognition Letters 21 
651-660. 

10. Shum, H.Y., Hebert, M., Ikeuchi, K., Reddy, R., 1997. An Integral Approach to 
Free-Form Object Modeling, IEEE Trans. Pattern Analysis and Machine Intelli- 
gence 19, (12), 1,366-1,375. 

11. Zhang, Z., 1998. Image-based Geometrically-Gorrect Photorealistic Scene/Object 
Modeling(IBPhM):A Review, Proc. 3rd Asian Conference on Computer Vision 
(ACCV’98), 340-349. 

12. Zhang, Z., 1998 Determining the epipolar geometry and its uncertainty: A review. 
International Jonrmal of Compnter Vision. 











Robust Structural Indexing through Quasi-Invariant 
Shape Signatures and Feature Generation 



Hirobumi Nishida 

Ricoh Software Research_CenteTjJ^l^njCoishi^awaj^unk^O;ku^ 1 12-0002, Japan 



Abstract. A robust method is presented for retrieval of model shapes that have 
parts similar to the query shape presented to the image database. Structural 
feature indexing is a potential approach to efficient shape retrieval from large 
databases, but it is sensitive to noise, scales of observation, and local shape 
deformations. To improve the robustness, shape feature generation techniques 
are incorporated into structural indexing based on quasi-invariant shape 
signatures. The feature transformation rules obtained by an analysis of some 
particular types of shape deformations are exploited to generate features that 
can be extracted from deformed patterns. Effectiveness is confirmed through 
experimental trials with databases of boundary contours, and is validated by 
systematically designed experiments with a large number of synthetic data. 



1 Introduction 

Efficient and robust retrieval from large image databases by shape [1] is an important 
problem. In particular, shapes observed in natural scenes are often occluded, 
corrupted by noise, and partially visible. It is challenging to develop a robust method 
for efficient retrieval of model shapes that have parts similar to the query shape 
presented to the image database. 

Shape retrieval from image databases has been studied recently for improving 
robustness against noise and shape deformations. For structural organization [2,3] of 
databases composed of boundary contours of objects, Del Bimbo [4] and Mokhtarian 
et al. [5] apply the curvature scale-space approach to feature indexing, and Sclaroff 

[6] proposes a method for image indexing with the modal matching. However, these 
methods assume that the query shape is presented as a closed contour. Structural 
feature indexing is efficient, but is sensitive to noise, scales of observation, and local 
shape deformations. The correct model does not necessarily receive as many votes as 
expected for the ideal case, and the performance is degraded drastically. Stein et al. 

[7] cope with this problem by extracting features from several versions of polygonal 
approximations of boundary contours. This method can also be applied when the 
query shape is presented as part of a boundary contour, but the efficiency is degraded. 

Efficiency and robustness are important, but sometimes incompatible criteria for 
performance evaluation. The improvement of robustness implies that the scheme for 
classification and retrieval should tolerate certain types of variations and deformations 
for images. Obviously, it may lead to inefficiency if some brute-force methods are 
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employed such as a generate-and-test strategy by generating various images with a 
number of different parameters. 

An idea to achieving both efficiency and robustness is to handle deformations in 
the feature domain by generating features that can be extracted from deformed 
patterns [8]. Feature generation models can be obtained through an analysis of feature 
transformations due to some particular types of shape deformation. Robustness can be 
improved by incorporating generated features in the indexing process. This approach 
has been applied successfully when queries are presented as closed contours [8]. It is 
attractive to extend and generalize this approach to the problem being considered. 

In this paper, based on the structural indexing with feature generation models, an 
efficient, robust method is presented for retrieval of model shapes that have parts 
similar to the query shape presented to the image database. This paper is organized as 
follows: In Section 2, a structural representation of curves by quasi-convex/concave 
features along with quantized-directional features [8] is outlined. In Section 3, based 
on the shape representation outlined in Section 2, we describe the quasi-invariant 
shape signature, the model database organization through feature indexing, and the 
shape retrieval through voting. In Section 4, the transformation rules of shape 
signatures are introduced to generate features that can be extracted from deformed 
patterns caused by noise and local shape deformations. The proposed algorithm is 
summarized in Section 5. In Section 6, the proposed method is validated by 
systematically designed experiments with a large number of synthetic data. Section 7 
concludes this paper. 



2 Shape Representation 



The structural representation of curves [8] is outlined in this section, based on quasi- 
convex/concave structures incorporating 2N quantized-directional features ( A is a 
natural number). As shown in Fig. la, the curve is first approximated by a series of 
line segments. On a 2-D plane, we introduce N -axes together with 2N quantized- 
direction codes. For instance, when N = A , eight quantized-directions are defined 
along with the four axes as shown in Fig. lb. Based on these N -axes together with 
2N quantized-direction codes, the analysis is carried out hierarchically. 






Fig. 1. (a) A closed contour with a polygonal approximation, (b) quantized-directional codes 

when A = 4, (c) sub-segments when A = 4, (d) segments when A = 4. 




698 



H. Nishida 



A curve is decomposed into sub-segments at extremal points along each of the N - 
axes. Fig. Ic illustrates the decomposition of a contour shown in Fig. la into sub- 
segments when N = A . For adjacent sub-segments a and b, suppose that we turn 
counterclockwise when traversing them from a to b, and the joint of a and b is an 
extremal point along the axes toward the directions (y, j -t- l(mo d2A),...,k).Then, we 

write the concatenation of these two sub-segments as a — ^ b . For instance, the 
joint of sub-segments H and G in Fig. Ic is an extremal point along the three axes 
toward the directions 3, 4, and 5. Therefore, the concatenation of H and G is written 
3 5 

as H 4- G . In this way, we obtain the following concatenations for the sub- 

segments illustrated in Fig Ic. 

M, l, k, i-H- j, 

/, G, F-H- G, ^ E, 

D—^ E, C-H- D, c, b, m 

By linking local features around joints of adjacent sub-segments, some sequences 
of the following form can be constructed: 

A2,Q),f(2,lj ... ( 1 ) 

A part of the curve corresponding to a sequence of this form is called a segment. 
When a segment is traversed from Oq to , one turns counterclockwise around any 
joints of sub-segments. The following segments, as shown in Fig. Id, are generated 
from the 13 sub- segments: 

5i:A^4- M, S2-.A^4- B^4- C— ^ 0^4- E, 
S2-.E^4 e, G, g, 

S(,\h-4- j—4 l-4- m. 

A segment is characterized by a pair of integers characteristic numbers, 

representing the angular span of the segment and the direction of the first sub- 
segment: 

r = XhO))mod 2 iV + I’O)- i(hl))mod 2 iV + 2 : d = 7(1-0) 

(=1 /=1 

The characteristic numbers are given by (2,7^, (7,3), (2,4), (3,0'^ , (4,3), and 
( 6 , 7 ) , respectively, for the six segments shown in Fig. Id. 

Adjacent segments are connected by sharing the first sub-segments or last ones of 
the corresponding sequences. These two types of connection are denoted by S — T 

and S-T , respectively, for two adjacent segments S and T. For instance, connections 
are denoted by — ^2 - 53 — ^4 - for the six segments shown in Fig. Id. 
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3 Quasi-Invariant Shape Signature, Indexing, and Voting 

Based on the shape representation outlined in Section 2, we describe the quasi- 
invariant shape signature, the model database organization through feature indexing, 
and the shape retrieval through voting. For the model database organization, we 
assume that each model shape is presented to the system as line drawings or boundary 
contours of objects. In the shape retrieval, we assume that line drawings or parts of 
some model shape can be given as a query to the system. 



3.1 Quasi-Invariant Shape Signatures 

In order to retrieve images for a query given as a partially visible shape, the shape 
signature is required to tolerate rotation, scaling, and translation. Therefore, features 
depending on orientation, size, and location cannot be employed as shape signatures. 
Based on the characteristic numbers and connections of segments extracted from 
model shapes or query shapes, the shape signature is constructed to satisfy this 
requirement. We assume that a series of n segments Si {i = have been 

extracted with characteristic numbers {ri,di'^ and lengths . The angular span r, 

does not depend on orientation, size, or location. Furthermore, the lack of information 
due to dropping orientation, size, and location can be compensated by employing a 
triplet of the angular spans of two consecutive segments and their length ratio as the 
shape signature. From two consecutive segments Si and connected as 

Si -Si^i, c e [h,t ] , the quasi-invariant shape signature is constructed as follows: 




Fig. 2. Model database organization by structural indexing. Each table item stores a list 
whose element is composed of the model identifier, length, location of the center of gravity, 
and locations of the two end points of the curve segment. 
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where Q is the number of quantization levels for length-ratio parameters. 



3.2 Indexing 

From each model shape, shape signatures are extracted from all pairs of consecutive 
segments. A large table, as illustrated in Fig. 2, is constructed for a model set by 
assigning a table address to a shape signature and by storing there a list whose item is 
composed of the following elements: the model identifier that has the corresponding 
shape signature, and shape parameters of the curve segment corresponding to the 
shape signature, namely length, location of the center of gravity, and locations of the 
two endpoints, computed on the model shape. 



3.3 Voting 

Classification of the query shape is carried out by voting for the transformation space 
associated with each model. For each model, voting boxes are prepared for the 
quantized transformation space (o, Q,Xj,yj) , where ct is the scaling factor, 0 is the 
rotation angle, and {xj ,yj) is the translation vector. Shape signatures are extracted 
from the curve segment given as a query to the shape database. For each extracted 
shape signature, model identifiers and shape parameters are retrieved from the table 
by computing the table address. By comparing the shape parameters of the extracted 
shape signature with the registered parameters, the transformation parameters 
(a, 0, Xj,yj) can be computed for each model and the voting box corresponding to 
the transformation parameters associated with the model is incremented by one. In the 
implementation, transformation parameters a and 0 are computed from the line 
segment connecting the two endpoints, and {xj ,yj) is computed from the location of 
the center of gravity. 



4 Feature Generation Models 

Shape signatures extracted from the curve are sensitive to noise and local shape 
deformations, and therefore, the correct model does not necessarily receive as many 
votes as expected for the ideal case. Furthermore, when only one sample pattern is 
available for each class, techniques of statistical or inductive learning from training 
data cannot be employed for obtaining a priori knowledge and feature distributions of 
deformed patterns. To cope with these problems, we analyze the feature 
transformations caused by some particular types of shape deformations, constructing 
feature transformation rules. Based on the rules, we generate segment features that 
can be extracted from deformed patterns caused by noise and local shape 
deformations. In both processes of model database organization and classification, the 
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generated features by the transformation rules are used for structural indexing and 
voting, as well as the features actually extracted from curves. 

The following two types of feature transformations are considered in this work: 

- Change of convex/concave structures caused by perturbations along normal 
directions on the curve and scales of observation, along with transformations of 
characteristic numbers (the angular span of the segment and the direction of the 
first sub-segment). 

- Transformations of characteristic numbers caused by small rotations. 

We describe these two types of transformation in the rest of this section. 



4.1 Transformations of Convex/Concave Structures 

The convex/concave structures along the curve are changed by noise and local 
deformations, and also depend on scales of observations. For instance, two parts of 
curves shown in Fig. 3 a are similar to one another in terms of global scales, but their 
structural features are different. When = 4 , the curve shown on left is composed of 

three segments connected as — S2 — with characteristic numbers ( 6 , 6 ), ( 2 , 6 ), 

and (3,2) , whereas the one shown on right is composed of five segments connected 

as S[-S'2—S'2-S'^—S'^ with characteristic numbers ( 6 , 6 ), ( 2 , 6 ), ( 2 , 2 ), ( 2 , 6 ) , and 

(3,2) . To cope with such deformations, structural features on the two curves are 

edited so that their features can become similar to one another. For instance, the 
structural features illustrated in Fig. 3 a can be edited by merging the two segment 
blocks {^i, 52,53} and { 5 [', 5 ^, 53,54,55} to two segments 5 and S’ as shown Fig. 
3 b. In the structural indexing and voting processes, for an integer M specifying the 




Fig. 3. (a) Part of curves similar to one another in terms of global scales, (b) editing structural 

features by merging segment blocks, (c) transformations of characteristic numbers of segments 
by small rotations. 
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maximum number of segments to be merged, shape signatures are generated by 
applying Rule 1 described below to segment blocks. 

Rule 1: Let be the characteristic number of the segment , and 

be length of the curve composed of k consecutive segments 
s jS ■■■ • From a segment block 



lis j ,m,n) = 



' j+k 



k = 0X---,in,---,m + n-l;Sj — ■Sy+i '"^j+m-i 



’ j+m+n-l 



where m and n are odd such that l< m,n < M , a shape signature 



m-1 n-i 

Z (“ 0+i ’ Z (“ ’ Cj , 

k=0 



n-\ 

•z 

k=Q 



Q' 



II 






j+m-h 



^ jx\ ’ ^ j+m~\^ j+m ” ' ^ j+m+n 






m-1 n-\ 

is generated if X(-i)‘o„>2, x(-i) ^ jxmxk — ^ ’ ^j+2k-2 ^j+2k~l ^j+2k — ^ 

k=0 k=0 

for k = 1, . . . , (m - 1)/2 , and 0+„+2A-2 “ O+m+2/t^i + O+m+2-t ^ 2 for 
k = l,...,(n-l)/2. 

For instance, when = 4 and M - 3 , from the six segments illustrated in Fig. Id 
with characteristic numbers (2,7^, (7,3^, (2,4^, (3,0j , (4,3^, and (6,7^, the 

following shape signatures (length-ratio omitted) are generated by Rule 1: (2, 7, h), (2, 
8, h), (7, 2, t), (7, 3, t), (8, 4, t), (2, 3, h), (2, 5, h), (3, 6, h), (3, 11, h), (3, 4, t), (5, 2, t), 

(4, 6, h), (4, 11, h), (6, 2, t), (11, 2, t), (11, 3, t). In total, at most n-\M/2l^ shape 
signatures are generated from n segments. 



4.2 Transformations of Characteristic Numbers by Small Rotations 

The characteristic number {r,d'j (r>2) can be transformed by rotating the shape. 

Rules can be introduced for generating characteristic numbers by rotating the shape 
slightly (see Fig. 4c). 

Rule 2: When the curve composed of the two consecutive segments and S 2 
with characteristic numbers (r^,di'^ and (^2,(72) is rotated by angle 
v|/ {-njN <\i^ <njN'), the angular spans and rj can be transformed into one of 
the 9 cases: (1) (ri,r2), (2) (ri,r2-l), (3) (ri,r2+l), (4) (ri-l,r2), (5) 
(h “i’'‘2 (6) (h “1’'‘2 +1)- (2) (''1 + 1- ''2 ) > (8) (/-j +1,T2 -l), (9) (ri+l,r2+l). 

Note that the cases (4 — 6) are applicable only if rj > 3 , and that the cases (2), (5), and 
(8) are applicable only if T2>3. 

For instance, when = 4 and M - 3, the 16 shape signatures have been 
generated by Rule 1 from the six segments illustrated in Fig. Id. Then, by applying 
Rule 2 to these generated ones, 120 shape signatures, in total, are further generated. 
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5 Algorithm 

In the model database organization step by structural indexing, shape signatures are 
generated from each model shape by Rules 1 and 2, and the model identifier with the 
shape parameters is appended to the list stored at the table address corresponding to 
each generated shape signature. For each model i {i = \,2,...,n), let c, be the 
number of shape signatures generated by Rules 1 and 2. For instance, c,- = 120 for the 
contour shown in Fig. la when = 4 and M = 3 . In the classification and retrieval 
by voting for models and the transformation space, from shape signatures extracted 
from the query shape, shape signatures are generated by Rules 1 and 2. Model 
identifier lists are retrieved from the tables by using the addresses computed from the 
generated shape signatures, and the transformation parameters are computed for each 
model on the lists. The voting box is incremented by one for the model and the 
computed transformation parameters. Let v, (i = \,2,...,n) be the maximum votes 
among the voting boxes associated with the model i . The query shape is classified by 
selecting out some models according to the descendant order of v,- / c,- . Examples of 
shape retrieval are given in Fig. 4, where query shapes are presented at top along with 
retrieved model shapes. 



6 Experiments 

In this section, the proposed algorithm is evaluated quantitatively in terms of the 
robustness against noise and shape deformations, based on the systematically 
designed, controlled experiments with a large number of synthetic data [8]. We 
examined the probability that the correct model is included in top t% choices for 
various values of the deformation parameter (5 [8] when curves composed of r% 
portions of a model shape are given as queries. For given values of r and (5 , a sub- 
contour of r% of length is randomly extracted from the model shape, and then, it is 
deformed by the deformation process as described in Nishida [8]. 

The main contribution of this work is to incorporate the shape feature generation 
into the structural indexing for coping with shape deformations and feature 
transformations. Therefore, the performance was compared with a naive method 
extracting features from several versions of piecewise linear approximations of the 
curve with a variety of error tolerances for approximations. 

We carried out several experimental trials by changing the number of models from 
200 to 500, examining the classification accuracy in terms of the deformed portions of 
model shapes given as queries to image databases. In the implementation of the naive 
method, by changing the error tolerance with Ramer’s method from 1% to 20%, with 
a step of 1%, of the widest side of the bounding box of the curve, twenty versions of 
approximations were created for each model shape and the query shape. 

Table 1 presents the average classification rates for top 1%, 2%, 3%, 5%, and 10% 
choices when P e [0.0,0.5] , Pe[0.5,1.0], and Pe[l.0,1.5]. For instance, when a 
curve segment composed of 80% portions of a model shape subjected to the 
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deformation process with P e [0.5,1.0] is given as a query to shape databases of 500 
models, the correct models are included in top 15 choices (3%) with probability 
98.2% for proposed algorithm and with probability 83.9% for the naive method. The 
computation time of the proposed method is comparable to that of the naive method. 
Clearly, significant improvements of robustness against noise and local shape 
deformations can be observed for the proposed algorithm in terms of classification 
accuracy without a degradation of efficiency. Through the experiments, the 
effectiveness has been verified through the experiments for the shape signature and 
the shape feature generation models. 



7 Conclusion 

Structural feature indexing is a potential approach to efficient shape retrieval from 
large image databases, but the indexing is sensitive to noise, scales of observation, 
and local shape deformations. It has now been confirmed that efficiency of 
classification and robustness against noise and local shape transformations can be 
improved at the same time by the feature indexing approach incorporating shape 
feature generation techniques [8]. In this paper, based on this approach, an efficient, 
robust method has been presented for retrieval of model shapes that have parts similar 
to the query shape presented to the image database. The effectiveness has been 
confirmed by experimental trials with a large database of boundary contours and has 
been validated by systematically designed experiments with a large number of 
synthetic data. 
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Fig. 4. Examples of shape retrieval. 
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Table 1. Average classification rates (%) of deformed patterns hy the proposed algorithm in 
terms of the portion of model shapes (r%) presented as queries, in comparison with the naive 
method. 
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Abstract. This paper proposes an efficient algorithm that not only can narrow 
down the search domain of face identification but also can reconstruct various 
3D objects from a single free-hand line drawing. The algorithm is executed in 
two stages. In the face identification stage, we generate and classify potential 
faces into implausible, basis and minimal faces by using geometrical and topo- 
logical constraints to reduce search space. The proposed algorithm searches the 
space of minimal faces only to identify actual faces of an object fast. In the ob- 
ject reconstruction stage, we introduce 3D regularities and quadric face regu- 
larities to reconstruct 3D object accurately. Eurthermore, the proposed method 
can be applied to a wide scope of general objects containing flat and quadric 
faces. The experimental results show that the proposed method identifies faces 
much faster than previous ones and efficiently reconstructs various objects from 
a single free-hand line drawing. 



1 Introduction 

During the conceptual design stage of mechanical parts, designers tend to draw their 
basic ideas of the mechanical parts mainly on papers with pencil. The method of rep- 
resenting 3D information by using a line drawing is easy to input geometrical infor- 
mation. Once the 3D model is obtained, it can be manipulated/modified, and further 
detail can be sketched in to obtain more detailed and accurate object. This approach 
provides designers with the means to convey their ideas to a CAD system. Therefore, 
it is necessary to develop an algorithm for automatically reconstructing 3D objects 
from a free-hand line drawing. 

Much work has been studied on the reconstruction of 3D objects from a line draw- 
ing. Marti et al [1] expanded junction library by using line-labeling techniques. How- 
ever, he relied on the line font (dashed/solid) to extract spatial information. Marill [2] 
suggested an optimization-based reconstruction for depth information of vertices using 
MSDA. Leclerc et al [3] identified all non-self-intersecting closed circuits of edges. 
However, his method does not applied to the case of concave faces and the case with 
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ambiguities in a sketch. In addition, he amended Marill [2]’s method using face pla- 
narity; however, their methods limited object types. Sphitalni et al [4] identified faces 
efficiently by using maximum rank equation and face adjacency theory. However, it 
requires large search domain of faces including numerous implausible faces. Lipson et 
al [5] reconstructed 3D object containing flat and cylindrical faces based on optimiza- 
tion method that formalizes various image regularities. However, the reconstruction 
results tend to produce a somewhat distorted 3D object, and they are limited to objects 
containing flat and cylindrical faces. 

Despite many methods are proposed, it is difficult to develop a practical recon- 
struction system for the following three reasons: (1) Because 2D line drawing corre- 
sponds to multiple 3D objects and contains tremendous potential faces, previous 
methods require large combinatorial searches of face identification. (2) The recon- 
struction results tend to produce a somewhat distorted 3D object due to the inherent 
inaccuracies in line drawing. (3) In addition, the error of reconstruction of a curved 
object is significantly increased because most of 2D image regularities are derived 
from planar configuration of 2D entities. 

In this paper, we describe a novel algorithm for identifying 2D actual faces of an 
object fast and reconstructing 3D objects efficiently from a single free-hand line 
drawing. Conventional methods classify potential faces into implausible and minimal 
faces to identify actual faces through the combinatorial search of minimal faces. This 
paper proposes a new method for minimizing the number of minimal faces by effi- 
ciently classifying potential faces into implausible, basis and minimal faces. By intro- 
ducing constraints for considering relation between line drawing and an object, we 
recognize basis faces that can be determined to actual faces without searches, implau- 
sible faces that can’t be actual faces and undetermined minimal faces. 2D actual faces 
of an object are identified fast by searching reduced minimal faces only. Furthermore, 
the proposed algorithm reconstructs various objects containing flat and quadric faces 
by introducing constraints of 3D regularities and quadric face regularities. 



2 Overview of the Reconstruction Process 

In this paper, the input is a single free-hand line drawing [4, 5]. 2D sketch represents a 
general object in wireframe. The projection reveals all edges and vertices uniquely. In 
addition, all drawn lines represent real edges, silhouette curves or intersections of 
faces in the 3D object. The algorithm supports general (manifold and non-manifold) 
objects containing flat and quadric faces. 

Reconstruction of 3D objects from 2D line drawing consists of two stages: face 
identification and object reconstruction. In the face identification stage, the algorithm 
first analyzes line drawing to obtain edge-vertex graph [6], then it restore topological 
information of an object using topological/geometrical constraint of edge-vertex 
graph. In the object reconstruction stage, the algorithm reconstructs geometrical in- 
formation of an object by using various constraints of regularities. The system concept 
is illustrated in Fig. 1 . 




708 



B.-S. OhandC.-H. Kim 





Fig. 1. System overview 



3 Identification of Faces 

In this section, we will discuss the method for reducing search space of face identifi- 
cation by classifying potential faces based on geometrical and topological constraints. 



3.1 Face Classification 

Because there are numerous potential faces that potentially correspond to faces of the 
depicted object in a line drawing, it is necessary to reduce search space of face identi- 
fication. We describe several constraints to cut down search space by classifying po- 
tential faces into implausible, minimal and basis faces as in Table 1. We refer each 
face set to PF, IF, MF, and BF, respectively. 



Table 1. Classification of potential faces 

Face class Description Symbol 

Potential All candidate faces PF 

Implausible Implausible faces of all object IF 

Minimal Candidate of actual faces of an object MF 

Basis Actual faces of all objects BF 

If we set the actual faces of an object be AF, we can drive Equations (1) - (3), 
which mean that we can identify actual faces by searching minimal faces only. 

IFuMFuBF =PF . (1) 

IF nMF =MF nBF = BF n IF = 0 . (2) 



BF c AF, AF c (BF u PF) ■ 



(3) 
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3.2 Face Classification Step 

We define rank R(v) and R(e) as the number of faces whose boundary contains that 
entity, and the upper bound of the ranks are denoted by R*(v) and R*(e) [3]. In addi- 
tion, we define RF(v) and RF(e) as the sets of faces whose boundary contains that 
entity. 

There are 6 steps for classification of potential faces. 

Step 1. Generate all potential faces using n edges[4], i.e., PF. Initially, 
IF = MF = BF = (!>. 



PF = make _ potential _ face{e^, •••,£„}■ (4) 

Step 2. Find implausible faces, IF, contain internal edge(s). 

{/ I /,/„/, e PF,U = (f, u /,) - (/, n /,), if Vc e (/, n /J, (5) 

e is the internal edge of f } 

Step 3. Find basis faces, BF. 

[f\f e{PF-IF) = F, (6) 

[Connected edges Cj, Cj, n[RF(e^) n RF(e 2 )] = 1, / e RF{ef) n RF{ef )\ } 

Step 4. Find implausible faces by using maximum rank. 

{/ I / e (PF -BF -IF ),[3e,RBF(e) = R\e), f & (RF (e) - RBF (e))]] ■ (7) 

Step 5. Recover over-reduced minimal faces. 

{f\feIF,F = (PF - IF), f e makeface{e \ (R + (e) - n(RF(e)) > 1} ) ■ (8) 



Step 6. Repeat Step 3 ~ Step 5 until no change of the face class. All faces in (PF-IF- 
BF) are undetermined minimal faces. 

For example, in Step 1, we can generate 15 potential faces from 2D sketch of itself 
in Fig. 2. 





Fig. 2. 15 all potential faces that could be used in 3D reconstruction from 2D sketch 
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In Step 2, we can find 7 implausible faces, and/„. Applying Equa- 
tion (5) we can find 6 basis faces, f^, f^, f^, /,, However, according to the face 

adjacency theorem [3], faces /^,/,^ cannot coexist. Therefore, some constraints must be 
added into Step 3. 

{/ I fv fi e BF, Me e (/j n f^) are smooth[A\] ■ (9) 

By applying Equation (9), two faces fj and/„ remains in potential faces. In Step 4, 
we find implausible faces /,„ and /„ as shown Fig. 3. Step 5~6 do not affect on this 
example. Finally, we extract 4 basis faces /,,/^,/, and/,^, and two minimal faces and 
/„. By searching minimal faces only, we can identify actual faces of an object fast. 




Fig. 3. Implausible faces complying with Equation (7) 

Identifying faces of an object in sketch can be formulated as a selection problem, 
i.e., selecting k faces among the m potential faces such that the k faces represent a 
valid object by combinatorial searches 2". 

We can identify actual faces by using minimizing Equation (10) [4]. By minimizing 
the number of minimal faces in combinatorial search, we identify actual faces fast. 

|f?"(e)-I?(e)|-H|/?"(v)-/?(v)|. ( 10 ) 



4 Reconstruction of 3D Objects 

In this section, we introduce constraints of 3D regularities and quadric regularities to 
minimize distortion of the reconstructed 3D object resulted from inaccurate 2D sketch. 



4.1 Basic Reconstruction Algorithm 

To reconstruct geometrical information of 3D object, Lipson et al [5] proposed 13 
image regularities. A 3D configuration can be represented a compliance function by 
summing the contributions of the regularity terms. The final compliance function to be 
optimized takes the form 



w^Vla 1- 

» '' ^ regularity -I 



( 11 ) 
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However, reconstruction results tend to produce a somewhat distorted 3D object 
due to the inherent inaccuracies in the sketch and 2D image regularities. 



4.2 3D Regularities and Quadric Face Regularities 

We introduce some geometrical constraints of 3D regularities and quadric face regu- 
larities that are used into Equation (11) with 2D image regularities to reconstruct 3D 
objects more accurately. 

[Face parallelism] : A parallel pair of planes in the sketch plane reflects parallelism 
in space. The term used to evaluate is 

'V r “h 

«/<.« =L[cos («i-«2)] ■ 

parallelism f=l 

where, and denote all possible pairs of normal of parallel faces, and n is the num- 
ber of pairs of parallel faces. 



[Face orthogonality] : An orthogonal pair of faces in the sketch plane reflects or- 
thogonality in space. The term used to evaluate is 

'Vr ■ "L 

«/<.« =2^[sm («i-«2)] ■ 

orthogonal ity j = 1 

where, n, and denote all possible pairs of normal of orthogonal faces, and n is the 
number of pairs of orthogonal faces. 



It is simple to find parallel or orthogonal faces by using angular distribution graph 
that identifies prevailing axis system. First, we find each edge’s prevailing axis. Then, 
all faces contain at most two prevailing axes. If two faces containing two axes have 
the same axes, and then they are parallel faces, else they are orthogonal faces. 

In addition, we introduce simple radius regularities affecting quadric faces. 

[Radius equality] 

, j ,2 (14) 

a rad, us =^id,-d^) ■ 

equality ,-=i 

where, d, and are distance from center of curve to the end-vertices, and n is the 
number of quadric faces. 



In addition, we assign high weight to the regularity of face planarity to reconstruct 
the most plausible solution with reality. 
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5 Experiments 



5.1 Results 



To estimate the efficiency of the proposed algorithm, we applied the method to vari 
ous objects as shown in Fig. 5. The experiment is done on a PC with Pentium III proc 
essor (450MHz). 




Fig. 4. Experimental Results (left: 2D sketch; center: 2D faces; right: 3D object) 
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5.2 Discussions 



Table 2 shows that our method efficiently narrows down the search space of face 
identification to a manageable size. And, the total time is dramatically reduced in most 
cases by using the proposed method. 

To evaluate the effect of 3D regularities, we check the 3D error and 2D error. We 
define 3D error as the distance between the depth of reconstructed object’s vertices 
and the real depth of synthetic object’s vertices, and we define 2D error as the sum of 
regularities proposed by Lipson et al [5]. 

When 3D regularities and quadric face regularities are used to improve the model 
(after 20 iteration), they can perturb the error curve as demonstrated by the sudden 
spike (Fig. 5 and Fig. 6). However, they improve significantly the shape of an object 
with more iteration. 

Fig. 5 shows that although 2D error is not improved, the constraints of 3D regulari- 
ties improve the shape of reconstructed polyhedral object significantly. Fig. 6 shows 
the effect of quadric face regularities. When 3D regularities and quadric faces regu- 
larities are used in curved object, they reduces 2D error as well as 3D error signifi- 
cantly in the case of quadric object. However, the error in curved objects still more 
significant than that in polyhedral objects because most of regularities are derived 
from 2D planar configuration of a line drawing. 

We apply the proposed method to real tower scenes to evaluate the efficiency of the 
proposed algorithm (Fig. 7). 



Table 2. Evaluation of face identification 



A: proposed method, B: [4]’s method 



Fig. 4. 
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Fig. 5. Evaluation of 3D regularities in the case of polyhedral object: Fig. 4(c) 



3I> rrgulnrity 
C^uudric fuc«.* rvKularlty 



2I> Krror 



2 4 6 S 10 12 14 ie 18 2C#22 24 29 28 30 32 34 38 38 40 



Krror 



Fig. 6. Evaluation of 3D regularities and quadric face regularities in the case of quadric object: 
Fig. 4(e) 



6 Conclusions 

We have presented an efficient algorithm for reconstructing a 3D object from a single 
free-hand line drawing. We cut down the search domain by classifying potential faces 
into implausible, basis and minimal faces. As a result, we can identify actual faces of 
an object fast by searching minimal faces only. In addition, we introduced the 3D 
regularities and quadric regularities to reconstruct various 3D objects more accurately. 

Future works will be focused on the regularities of curved faces and on the reduc- 
tion of reconstruction errors. 
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Fig. 7. Real tower scene : 3D reconstruction & navigation from one image 
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Abstract. Most CBIR systems use low-level visual features for repre- 
sentation and retrieval of images. Generally such methods suffer from 
the problems of high-dimensionality leading to more computational time 
and inefficient indexing and retrieval performance. This paper focuses on 
a low-dimensional color and shape based indexing technique for achiev- 
ing efficient and effective retrieval performance. We propose a combined 
index using color and shape features. A new shape similarity measure 
is proposed which is shown to be more effective. Images are indexed by 
dominant color regions and similar images form an image cluster stored 
in a hash structure. Each region within an image is further indexed by 
a region-based shape index. The shape index is invariant to translation, 
rotation and scaling. A JAVA based query engine supporting query-by- 
example is built to retrieve images by color and shape. The retrieval 
performance is studied and compared with a region-based shape index- 
ing scheme. 



1 Introduction 

The past few years have seen many advanced techniques evolving in Gontent- 
Based Image Retrieval (GBIR). Applications like medicine, entertainment, edu- 
cation, manufacturing, etc. make use of vast amount of visual data in the form 
of images. This envisages the need for fast and effective retrieval mechanisms in 
an efficient manner. A major approach directed towards achieving this goal is 
the use of low-level visual features of the image data to segment, index and re- 
trieve relevant images from the image database. Recent GBIR systems based on 
features like color, shape, texture, spatial layout, object motion, etc., are cited in 
Of all the visual features, color is the most dominant and distinguishing 
one in almost all applications. 

* This work is partly supported by the AICTE Young Teachers Career Award 
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1.1 Previous Work in CBIR 

Current CBIR systems such as IBM’s QBIC allow automatic retrieval 

based on simple characteristics and distribution of color, shape and texture. 
But they do not consider structural and spatial relationships and fail to cap- 
ture meaningful contents of the image in general. Also the object identification 
is semi-automatic. The Chabot project 0 integrates a relational database with 
retrieval by color analysis. Textual meta-data along with color histograms form 
the main features used. VisualSEEK P allows query by color and spatial layout 
of color regions. Text based tools for annotating images and searching is pro- 
vided. A new image representation which uses the concept of localized coherent 
regions in color and texture space is presented by Carson et al. 0 . Segmentation 
based on the above features called “Blobworld” is used and query is based on 
these features. 

Some of the popular methods to characterize color information in images are 
color histograms color moments uni and color correlograms m Though 

all these methods provide good characterization of color, they have the problem 
of high-dimensionality. This leads to more computational time, inefficient index- 
ing and performance. To overcome these problems, use of SVD Pj, dominant 
color regions approach and color clustering m have been proposed. 

1.2 Recent Work in Shape-Based CBIR 

Shape also is an important feature for perceptual object recognition and classi- 
fication of images. It has been used in CBIR in conjunction with color and other 
features for indexing and retrieval. 

Shape description or representation is an important issue both in object 
recognition and classification. Many techniques, including chain code, polygo- 
nal approximations, curvature, fourier descriptors and moment descriptors have 
been proposed and used in various applications ISI Recently, techniques using 
shape measure as an important feature have been used for CBIR. Features such 
as moment invariants and area of region have been used in 0,115!, but do not 
give perceptual shape similarity. Cortelazzo [HI used chain codes for trademark 
image shape description and string matching technique. The chain codes are not 
normalized and string matching is not invariant to shape scale. Jain and Vailaya 
m proposed a shape representation based on the use of a histogram of edge 
directions. But these are not scale normalized and computationally expensive 
in similarity measures. Mehrotra and Gary PEI used coordinates of significant 
points on the boundary as shape representation. It is not a compact represen- 
tation and the similarity measure is computationally expensive. Jagadish jzni 
proposed shape decomposition into a number of rectangles and two pairs of co- 
ordinates for each rectangle are used to represent the shape. It is not rotation 
invariant. 

A region-based shape representation and indexing scheme that is translation, 
rotation and scale invariant is proposed by Lu and Sajjanhar m- It conforms to 
human similarity perception. They have compared it to Fourier descriptor model 
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and found their method to be better. But, the images database consists of only 
2D planar shapes and they have considered only binary images. Moreover, shapes 
with similar eccentricity but different shapes are retrieved as matched images. 
Our aim is to extend this method to represent color image regions and augment 
the color index of our previous work m with the shape features. Our shape 
indexing feature and similarity measure is different and shown to be effective in 
retrieval. A combined index based on color and shape has been implemented to 
improve retrieval efficiency and effectiveness. 

The paper is organised as follows: Section 2 describes the color and shape 
features used for indexing. The indexing scheme, querying and similarity measure 
are explained in section 3. Section 4 highlights the results of our approach and 
sample output. A comparision of performance with the scheme employed in m 
is also covered in this section. 

2 Color and Shape Features 

The initial step in our approach is to index images based on dominant color re- 
gions [E|. Image regions thus obtained after segmentation and indexing are used 
as input to the shape module. The region-based shape representation proposed 
in IZH is modified to calculate the shape features required for our proposed shape 
indexing technique and similarity measure. It is simple to calculate and robust. 
We show that the retrieval effectiveness is better compared to their method. 

2.1 Color Indexing Approach 

To segment images based on dominant colors, a color quantization in RGB space 
using 25 perceptual color categories is employed as is done in US). From the 
segmented image we find the enclosing minimum bounding rectangle (MBR) 
of the region, its location, image path, number of regions in the image, etc., 
and all these are stored in a metafile for further use in the construction of an 
image index tree. An index tree for the entire database is constructed when the 
query engine is initiated. At query time, similar images matched on the basis 
of color and spatial location are retrieved. To the above color index, we have 
included a region-based shape index similar to the one in m which is invariant 
to rotation, scaling and translation. Since their representation is suited only for 
2-D binary image regions, we have used a different shape feature to index the 
color regions and also a suitable shape similarity measure. A comparision of the 
two techniques has been carried out for an image database consisting of flags, 
flowers, fruits, vegetables and simulated shape regions. 

2.2 Shape Representation 

Definitions of terminology: 

— Major axis: it is the straight line segment joining the two points on the 
boundary farthest away from each other (in case of more than one, select 
any one). 
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— Minor axis: it is perpendicular to the major axis and of such length that a 
rectangle with sides parallel to major and minor axes that just encloses the 
boundary can be formed using the lengths of the major and minor axes. 

~ Basic rectangle: the above rectangle formed with major and minor axes 
as its two sides is called basic rectangle. 

— Eccentricity: the ratio of the major to the minor axis is called eccentricity 
of the region. 

— Centroid or Center of gravity: a single point of an object/region towards 
which other objects/regions are gravitationally attracted. For 2D shapes, the 
coordinates {Xc,Yc) of the centroid are defined as: 

= E:. Ey y)x/ fix, y) 

Yc = Y.X T,y fix, y)y/ T,y fix, y) 

where (x,y) are pixel coordinates and f(x,y) is set to 1 for points within or 
on the shape and set to 0 elsewhere. 



Basic idea. Given a shape region, a grid space consisting of fixed-size square 
cells is placed over it so as to cover the entire shape region as shown in figure 
E We assign a ”1” to cells with at least 25% of pixels covered and ”0” to each 
of the other cells. A binary sequence of I’s and O’s from left to right is obtained 
as the shape feature representation. For example, the shape in the above figure 
can be represented by a binary sequence 11111111 01111111 00110110 00000110 
00000010 00000000. 

The smaller the grid size, the more accurate the shape representation is and 
more the storage and computation requirements. The representation is compact, 
easy to obtain and translation invariant. Hence, a scale and rotation normaliza- 
tion is carried out to make it invariant to scale and rotation. 



Rotation normalization. The purpose of rotation normalization is to place 
shape regions in a unique common orientation. Hence the shape region is rotated 
such that its major axis is parallel to the x-axis. 

There are still two possibilities as shown in figure E and I3 caused by 180° 
rotation. Further, two more orientations are possible due to the horizontal and 
vertical flips of the original region as shown in figures 0 and 0 respectively. Two 
binary sequences are needed for representing these two orientations. But only 
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one sequence is stored and at the time of retrieval we can account for these two 
sequences. 



Scale normalization. To achieve scale normalization, we proportionally scale 
all the shape regions so that their major axes have the same length of 96 pixels. 



Shape index. Once the shape region has been normalized for scale and rotation 
invariance, using a fixed size of grid cells (say 8x8), we obtain a unique sequence 
for each shape region. The grid size in our proposed method is kept as 96 x 
96 pixels. Each sub-grid cell is of size 12x12 pixels giving a binary sequence of 
length 64 bits per shape region. Using this sequence, we find both the row and 
column totals of the 8x8 grid and store them as our shape index, which is more 
robust and gives a better perceptual representation to the coverage of the shape. 
A suitable shape similarity measure using this index is employed for matching 
images at query time. 

3 Indexing Scheme and Qnerying 

3.1 Color Index 

A composite index based on a color look-up table is formed consisting of 25 
colors. The index is unique and given by the equation below: 

Index = C, * 25"-* 

Suppose (Cl, C2, C3) are the color indices of three dominant regions found 
within an image, where Cl represents index of the first dominant region, C2 
represents index of the second dominant region and C3 represents index of the 
third dominant region. 

Then, the index is given by 

Index = Cl * 25^ + C2* 25^ -k C3 * 25° 

Images with similar indices are stored in same hash entry of the hash table 
structure. Each entry also stores the color region features such as location, area, 
percentage of color, etc., associated to each region of the image which is used in 
the matching criteria. 



3.2 Shape Index 

For each color region processed above, we compute the shape descriptor as fol- 
lows: 



1. Compute the major and minor axes of each color region. 
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2. Rotate the shape region to align the major axis to X-axis to achieve rotation 
normalization and scale it such that major axis is of standard fixed length 
(96 pixels). 

3. Place the grid of fixed size (96x96 pixels) over the normalized color region 
and obtain the binary sequence by assigning I’s and O’s accordingly. 

4. Using the binary sequence, compute the row and column total vectors. These 
along with the eccentricity form the shape index for the region. 

3.3 Querying 

Given a query image, we apply the same process on the query image to obtain the 
color and shape features. Our implementation supports both Query-by-example 
and Query-by-feature for color matching. The shape matching module supports 
only Query- by-example. Based on the color index of the query image, a list of 
matching images are retrieved from the hash structure. Then the shape descrip- 
tors are used to find matching images from this initial set to retrieve the final 
images matched on both color and shape. 

The query process is as follows: 

1. The query image is processed to obtain a list of matching images based only 
on color features. 

2. For each color region in the query image, the shape representation of each 
region is evaluated. To take care of the problem of 180° rotation and vertical 
and horizontal flips, we need to store 4 sets of the shape index. 

3. Compare the shape index of regions in the query image to those in the list 
of images retrieved on color. 

4. Regions with only matching eccentricity within a threshold (t) are compared 
for shape similarity. 

5. The matching images are ordered depending on the difference in the sum 
of the difference in row and column vectors between query and matching 
image. 

3.4 Similarity Measure 

Let R and R' represent the row vectors of test image and query image respec- 
tively. Similarly, C and C' represent the column vectors of the test image and 
query image respectively. The similarity measure is computed as follows: 

1. Calculate the row and column vectors of all the regions in the query image. 

2. Find the row and column difference between query image regions and regions 
in the image to be tested using the equation: 

Cd = E. i\c^ - Ci\) 

where Rd and Cd are the row and column differences between the test image 
and query image region, R^ and Ci are the bit of row and column vectors 
in image and and (7' are the bit of row and column vectors in the 
query image. 

3. If {Rd + Cd) < T (threshold), then the images match. 
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4 Experimental Results and Performance 

The experimental database consists of about 200 images of flags and 120 images 
of fruits, flowers and simulated objects(squares, rectangles, triangles, circles, etc). 
Each image in the database is indexed on color and shape features. A hash 
table stores images of similar index based on the features extracted. Images are 
retrieved first based on the color index and displayed. Then shapes of all regions 
in the query image are compared to the region shapes in the displayed set to 
And images similar on the basis of shape index. 

An example output for retrieval for the image database of flowers, fruits and 
simulated shape regions is shown in figure El for matching on color and in figure 
El for matching on shape. It can be observed that images non-similar in shape 
are eliminated. The image on the left of the screen is the query image. Figures 
0 and 0 show the corresponding results for the image database of flags. 




Fig. 5. Retrieval of images based on 
matching color regions 



Fig. 6. Retrieval of images based on 
matching shape regions 
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Fig. 7 . Retrieval of images based on 
matching color regions 
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Fig. 8. Retrieval of images based on 
matching shape regions 
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Fig. 9. Retrieval results based on 
similarity measure of m- 



Fig. 10. Retrieval results based on 
our proposed similarity measure. 




Fig. 11. Retreival results based on 
similarity measure of m- 



Fig. 12. Retrieval results based on 
our proposed similarity measure. 



We have compared the results of our technique with that proposed in m- 
The output for the two comparative techniques are shown in figures El and El 
for the image database of flowers, fruits and simulated shape regions. Figures 
and II ‘21 show the corresponding difference in retrieval results for the flag image 
database. The outputs show that there is better pruning of the matched images 
using the row and column vector based technique for matching images. 

The retireval performance is measured using recall and precision, as is stan- 
dard in all CBIR systems. Recall measures the ability of retrieving all relevant 
or similar items in the database. It is defined as the ratio between the number 
of relevant or perceptually similar items retrieved and the total relevant items 
in the database. Precision measures the retrieval accuracy and is defined as the 
ratio between the number of relevant or perceptually similar items and the total 
number of items retrieved. 

The graph in figure Elshows the retrieval performance for the image database 
of flowers, fruits and simulated shapes in terms of color and shape curves. Similar 
analysis is done for the flag image database and shown in figure El 

A comparative study of the two different techniques of indexing and retrieval 
is shon in figures El and El respectively. It is seen that our indexing method and 
similarity measure provides better retrieval effectiveness. 
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Recall 




Recall 



Fig. 13. Recall-precision graph for 
image database of flowers etc. 



Fig. 14. Recall-precision graph for 
image database of flags. 




Recall 



Fig. 15. Recall-Precision comparing 
the two similarity measures for im- 
age database of flowers, etc. 




Recall 



Fig. 16. Recall-Precision comparing 
the two similarity measures for im- 
age database of flags. 



5 Conclusions 

A combined color and shape based low-dimensional indexing technique has been 
implemented. Images are segmented into dominant regions based on perceptually 
similar color regions using a color quantized indexing method. Such segmented 
out regions are stored in a hash structure as similar image clusters. Shape fea- 
tures of these regions are used to further prune the retrieval of images from 
a sample image database. The shape representation is based on a grid-based 
coverage of the region which is normalized to achieve invariance in scale, ro- 
tation and size. The index is a robust one. Our proposed index based on row 
and column vectors and the related similarity measure is shown to provide an 
efficient and effective retrieval performance. A JAVA based search engine using 
query-by-example has been developed on Windows-NT platform. The results 
and performance analysis of our method shows that it is effective and efficient. 
It can be further enhanced by including texture features. 
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Abstract. In this paper we propose a virtual drilling algorithm which 
is applied on 3-D objects. We consider that initial we are provided with a 
sparse set of parallel and equi-distant slices of a 3-D object. We propose 
a volumetric interpolation algorithm for recovering the 3-D shape from 
the given set of slices. This algorithm employs a morphology morphing 
transform. Drilling is simulated on the resulting volume as a 3-D erosion 
operation. The proposed technique is applied for virtual drilling of teeth 
considering various burr shapes as erosion elements. 



1 Introduction 

3-D object representation and processing simulation is required in many fields 
such as medicine, architecture, computer aided design, etc. CEE]. Very often 
we are not provided with the complete information about the object to be mod- 
eled and processed. In this paper we show how virtual processing operations can 
be simulated on volumes described by means of a group of slices representing 
parallel sections of its structure. Depending on the type of the object we rep- 
resent such images can be acquired by Computer Tomography (CT), Magnetic 
Resonance Imaging (MRI) or by mechanical slicing and digitization. Usually, 
the pixel size within a slice is different from the spacing between two adjacent 
slices. In such situations it is necessary to interpolate additional slices in order to 
obtain an accurate volumetric description of the object. In this paper we employ 
mathematical morphology for reconstructing a full 3-D shape from a group of 
slices and afterwards for modeling 3-D drilling in the resulting shape. 

There are two main categories of interpolation techniques for reconstructing 
objects from sparse sets: grey-level and shape-based. Grey-level interpolation 
methods employ nearest-neighbor, splines, linear jSj, or polynomial interpola- 
tion. Other algorithms employ feature matching 0| or homogeneity similarity 
0 for determining the direction of interpolation. Interpolation of additional 
cross-sections from shape contours in vector form is described in A distance 
function from each pixel to the object boundary is considered for interpolation 
in Q. Extensions of this algorithm are proposed in HE] An algorithm which 
uses the elastic matching interpolation, spline theory and surface consistency is 
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considered in In m each slice is eroded by a morphological operator until 
its number of pixels reaches the mean of those from the slices to be interpolated. 
A mixed shape and grey level based interpolation method is proposed in m 

A mathematical morphology based function interpolation algorithm called 
the skeleton by influence zones transform (SKIZ) has been employed in IT^ . 
The SKIZ transform interpolates by employing dilations of the intersection and 
of the complementary of the union of two sets m- However, such an approach 
does not correspond to a natural morphing of one set into the next one. In 
this paper we propose a morphing procedure for estimating the intermediary 
slices of the two given sets. The morphing transforms two neighboring sets by 
using combinations of dilations and erosions. The interpolated set corresponds 
to the idempotency of the two morphed sets after a certain number of iterations. 
This produces a new set sharing similarities in shape with both initial sets. The 
morphing transformation is applied repeatedly onto the new stack of interpolated 
sets until we recover an appropriate object shape. 

After reconstructing the 3-D volumes we simulate a drilling operation. Vir- 
tual surgery using 3-D data visualization has lately attracted a lot of attention 
due to its potential use in surgical intervention planning and training pi4| . We 
propose a morphological algorithm using 3-D structural elements for simulating 
drilling. Virtual drilling is modeled as a succession of volumetric erosions ori- 
ented along the chosen direction. The paper is organized as follows. Section 2 
describes the interpolation algorithm and Section 3 the simulation of drilling us- 
ing a 3-D operator. In Section 4 we provide experimental results when applying 
the proposed virtual drilling tool on a set of teeth reconstructed by interpolation, 
while the conclusions of this study are drawn in Section 5. 



2 Geometrically Constrained 3-D Interpolation 

Let us assume that we have two sets Xi and V^+i, which are sharing at least 
one common point Xi P| Xi+i ^ 0. We align these sets according to an (n — 1)- 
dimensional hyperplane (axis for 2-D sets) using matching or a centering oper- 
ation. Let us consider Xi^^a an element (pixel in 2-D or voxel in 3-D) contained 
into the set where to denotes an ordering number and denote the comple- 
ment (background) of the set Xi by Vf = E — Xi. After alignment, each element 
Xi^rn in one set will have a corresponding element which may be a member of 
the other set Xi+i^m S or may be part of its background Xf_^_.^ ^ 

Our morphing transformation ensures a smooth transition from one shape 
set to the other one by means of generating several intermediary sets. Let us 
consider the elements located on the boundary set, denoted by G : 

G = e V,| 3Xh I e (i) 

where denotes the neighborhood of the location Xi^rn, having the 

same size and shape as the structuring element B\. In our morphing operation, 
the elements of a boundary set Ci are changed differently according to their cor- 
respondences from the second given set These changes are defined in terms 
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of mathematical morphology basic operations such as dilations and erosions m- 
The dilation of a set A by the structuring element B is given by : 



yl( 



.B^\J Ah 

bGB 



(2) 



where © denotes dilation and Ai, represents a structuring element centered onto 
an element of the set A. The erosion of a set A by using the structuring element 
B is given by : 

AoB=f]Ab (3) 

b£B 

where 0 denotes erosion. These operations correspond to the Minkowski set 
addition and subtraction. The most commonly used structuring element is the 
elementary ball of dimension n. The dilation with the elementary ball, i.e. for 
n = 1, expands the given set with a uniform layer of elements while the erosion 
operator takes out such a layer from the given set. The structuring element con- 
sidered in this paper consists of a pixel and its horizontal and vertical immediate 
neighbors. We can identify three possible correspondence cases for the elements 
of the two aligned sets. One situation occurs when the border region of one set 
corresponds to the interior of the other set. In this case we dilate the border 
elements : 

If ^ Ci A ^ 

then perform Xi^m © B\ 



( 4 ) 



where B\ is the structuring element for the set W- A second case occurs when 
the border region of one set corresponds to the background of the other set. In 
this situation we have erosions of the boundary elements : 



If w.™ e C, A 3 Xf+i 
then perform Xi „^ © Bi 



( 5 ) 



No modifications are performed when both corresponding elements are members 
of their sets boundary : 



If Xi jji ^ Ci A ^ Ci-i-i 

then perform no change 



( 6 ) 



The last situation corresponds to the regions where the two sets coincide locally 
and no change is necessary, while @ and correspond to morphing transfor- 
mations. 

By including all these local changes we define the following morphing trans- 
formation applied on the set Xi depending onto the set and on the struc- 

turing element B\ : 

A similar morphing operation /(W-i-i |Ai, B 2 ) is defined onto the set W-i-i de- 
pending on the set X^ and on the structuring element i? 2 . The proposed mor- 
phing transformation is illustrated in Figure H 
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Fig. 1. Exempli cation of mathematical morphology morphing. The result produced 
by equation f{Xi+i\Xi, B 2 ) is represented with dashed lines while the result produced 
by equation f{Xi\Xi+i, Bi) is represented with dot-dashed lines: (a) {Xi 0 Bi) D 
(Xi+i©S2); (b) Xi-Xi+i / andXi+i-Xi/ . 



The morphing operation defined by f{Xi\Xi+i, Bi) and by f{Xi+i\Xi, B 2 ) 
is applied iteratively onto the sets resulted from the previous morphings. For 
isotropic interpolation we use identical structuring elements, B\ = B 2 = B, when 
morphing the two sets. The succession of morphing operations creates new sets 
starting from the two initial extremes. With each iteration these sets are closer 
in shape and size to each other. Eventually, the morphological transformations 
processing each slice will lead to the idempotency of the resulting sets. This set 
will represent the resulting interpolation. 

This procedure can be easily extended for gray scale objects. In our approach 
we employ bilinear gray-level interpolation in the overlapping area of the two sets 
{Xi Pi Xij^i). In the regions where only one of the sets is defined {i.e. Xi — Xij^x yf 
0 and Xij^i — Xi ^ Q) we replicate the gray level values of the existing set for 
the interpolated slice. 

3 Simulating Drilling by Volumetric Erosion 

Shape-based interpolation provides us with the 3-D object reconstruction. We 
need the reconstructed volume in order to simulate virtual processing opera- 
tions. In the following we propose a mathematical morphology based system 
for simulating drilling. Let us consider that the volume is made up of isotropic 
material and that the effects of drilling do not depend on the direction. The 
drilling of the 3-D volume proceeds along a certain direction. Let us consider a 
parametric spherical coordinate system in which the direction is shown by two 
angles {0,4>), where 6 represents the angle made by the drilling direction with 
the image plane {x,y) and 4> represents the angle made by the projection of the 
drilling direction on the image plane with the horizontal axis x. The drilling of 
a volume O produces a drilled volume, denoted as O, and can be represented as 
a succession of erosions with a volumetric structuring element denoted as B^^'^ : 

0 = 0{x,y,z)eB<-^^ 



(8) 
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where (x,y,z) denotes the point where the drilling is applied. The eroded voxels 
have assigned the value of the background while the object voxels corresponding 
to the surface of the structuring element become part of the boundary of the 
new object O. Drilling is simulated by successively eroding volumes of the 3-D 
structuring element size in the given direction. The drilling direction is shown 
by the following parametric equations : 



x{i) = x{i — 1) — dcos{9) cos{(p) 



(9) 



y{i) = y{i — 1) — dcos{6)sin{(j)) (10) 

z{i) = z{i — 1) — dsin{0) (11) 

where d is the width of the drilling element in the direction of the drilling, 
and where the starting drilling point has the coordinates (x(0),?/(0),z(0)). The 
number of times the 3-D erosion is applied depends on the speed and duration 
of the drilling. 

We can significantly speed up the erosion process by changing the reference 
system from being centered on the user position into considering the processed 3- 
D object as the reference. In this case we replace the heavy burden of calculating 
the directional erosion with rotating the entire volume such that the direction of 
drilling becomes parallel with z axis. We consider three shapes for modeling the 
volumetric erosion element: spherical, cylindrical and conical. At each erosion we 
extract a volume with the shape given by the corresponding structural element 
The corresponding region of the volume O has assigned the same grey level 
as its background, denoted as In this case the erosion with the spherical 
element at the iteration i is modeled by : 



{{x, y, z) GO, {x- ximY + {y- yimY + (z - z(i))^ < G O'" (12) 



where d is the radius of the spherical erosion element and (xiM,yiM) are the 
coordinates of the 3-D object point projection on the image plane, where the 
erosion takes place. The direction of erosion is considered perpendicular onto the 
image plane in this case. When employing the cylindrical erosion element, the 
local drilling effect is given by : 

{(x, y, z) G d, z{i) > z> z{i) -d,{x- xjmY + {y ~ yimY < R^} G O^ (13) 



where R is the radius and d is the height of the cylindrical erosion element. The 
depth is conventionally considered as a negative number. For the conical erosion 
element, the local drilling effect is simulated by : 

{{x,y,z) G 6, z{i) > z> z{i) -d,{x- XiM)f + {y-yiuf < [{z{i) ~ G 6 *= 

(14) 

where R and d are the conical erosion element radius and height, respectively. 
In this case the drilling direction is identical with that of the projection ray 
used for the volumetric visualization (we have used a parallel ray tracing algo- 
rithm) . The changes in the volume rendering are localized only in an area around 
{xim^Vim) depending on the drilling tool size. This contributes to a significant 
computational complexity reduction. 
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Fig. 2. Segmented and aligned set of slices for an incisor 



4 Simulation Results 

We have applied this algorithm in virtual dentistry. We employ the interpola- 
tion algorithm to reconstruct teeth. Afterwards, we apply the virtual drilling 
algorithm on the reconstructed 3-D teeth. In our simulations we consider three 
different types of teeth: an incisor (single root), a premolar (two roots) and a 
molar (three roots). Teeth from each of these categories have been mechani- 
cally sliced and digitized. The teeth boundaries as well as the root canals are 
segmented in each slice and the resulting object slices are aligned using a semi- 
automatic procedure. Initial slices after alignment are displayed in Figure |2| for 
an incisor. We have used the morphological interpolation algorithm described 
in Section 2 in order to reconstruct the tooth from the given initial group of 
slices. Tooth cross-sections are interpolated between each two consecutive slices. 
In the case of the incisor, the interpolation algorithm is applied recursively four 
times. Thus we obtain 336 interpolated slices from 22 original slices. The 3-D 
reconstruction from two different viewing angles are shown in Figures Ok, Et, for 
the incisor, in Figures Ofc, Otl, for the premolar and in Figures Ok, OF for the mo- 
lar, respectively. This result shows a smooth transition interpolating well even 
between slices having large geometrical shape variations. The morphology of the 
reconstructed tooth is accurate despite the fact that most of the slices have been 
reconstructed by interpolation. 

We have compared the mathematical morphology interpolation algorithm 
with a linear interpolation algorithm. The linear interpolation algorithm takes 
the midpoints of the line segments between pixels on object contours of the two 
slices, in both x (horizontal) and y (vertical) directions as the interpolated slice 
contour. We have applied the linear interpolation algorithm on the incisor se- 
quence displayed in Figure El For assessing the performance of the interpolation 
algorithms we have devised the following objective measure. Let Xi, Xi+\ and 
Xi +2 be three original tooth slices and Xi+i be the result of interpolating Xi 
and Xi+ 2 - Let |X| denote set cardinality. The ratio \XOR{Xi+i,Xi + l)l/|-^i-l-l|> 
representing the percentage of wrongly estimated pixels, can be used as a per- 
formance measure. In Table Q we provide the results for reconstructing three 
different slices from the incisor group of sets as well as the average result for re- 
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(a) incisor (c) premolar (e) molar 




(b) incisor (d) premolar (f) molar 

Fig. 3. Two different 3-D views for each of the three teeth. 



constructing any intermediary slice Xi^i from the given group of sets Xi, 
for any zG — 2}, where N is the number of initial sets. We can observe 

from Table ^ that the interpolated slice obtained by morphing is closer to the 
original slice than that interpolated by linear interpolation. The 3-D molar re- 
constructed by linear interpolation is displayed in Figure^, while in Figure^ 
we show the same molar reconstructed by morphological morphing as described 
in this paper. 



Table 1. Objective comparison measure between morphological morphing and linear 
interpolation when reconstructing an incisor. 



Frames 


Frame 

Difference (%) 

\XOR{Xi+2,Xi)\ 

Xi\ 


Morphological 
Morphing (%) 

\XOR{Xi+i,Xi+i)\ 

\Xi+i\ 


Linear 

Interpolation (%) 

\XOR{Xi+i,Xi+i)\ 

|Xi+i| 


4,5,6 


62.9 


5.9 


11.925 


10,11,12 


26.8 


6.84 


9.46 


18,19,20 


27.2 


7.5 


14.28 


Entire volume 


51.5 


9.25 


11.46 
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(a) Linear interpolation (b) Morphological morphing interpolation 
Fig. 4. Reconstruction of a molar in 3-D. 



After rendering and displaying the volume of a tooth, we simulate drilling 
by using a volumetric erosion element for the dental burr. We have chosen three 
different geometrical shapes for modeling the dental burr: spherical, conical and 
cylindrical. We associate the action of a drilling tool (dental burr) with the 
repetition of erosions done with various 3-D structural elements. 

The 3-D structural elements employed in our experiments for simulating 
different types of dental burrs are shown in Figures 0,, |5 Jd, 0:. The elementary 
drilling operation consists of eroding the 3-D object with the structuring element 
corresponding to the shape of the drilling burr. Effects of drilling on a tooth when 
using spherical, cylindrical and conical erosion elements are displayed in Figures 
Eli, El and|3^ respectively. We have applied the proposed morphological drilling 
tool for virtual dentistry by testing the drilling algorithm on teeth reconstructed 
by 3-D interpolation. Dentists have used an entire set of virtual drilling burrs 
with several different geometric parameter values. A 3-D tooth after being drilled 
by a dentist is displayed in Figure Et- A set of sections through the drilled tooth 
is shown in Figure Eb- Two dental operations that have a particular treatment 
significance can be observed in these figures. 

5 Conclusions 

We simulate a processing operation such as drilling on a 3-D volume given a set 
of sparse sets which represent parallel and equidistant object sections. We have 
proposed an algorithm for reconstructing the 3-D object from the given group 
of slices. This algorithm interpolates between each two adjacent slices by means 
of morphological shape-based interpolation. Virtual drilling is simulated by a 
succession of erosions. The proposed algorithms are applied on teeth that have 
been mechanically sliced and digitized. The morphological drilling algorithm can 
be used as a tool in a virtual sculpturing environment. 
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Fig. 5. Shapes of volumetric erosion elements employed as burrs for tooth drilling: 
(a) spherical; (b) cylindrical; (c) conical; (d), (e), (f) volumetric erosion results on an 
incisor for the given set of tools, assuming only one erosion. 
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Abstract. This work describes a system for detecting and classifying 
malaria parasites in images of Giemsa stained blood slides in order to 
evaluate the parasitaemia of the blood. The first aim of our system is 
to detect the parasites by means of an automatic thresholding based on 
a morphological approach. Then we propose a morphological method to 
cell image segmentation based on grey scale granulometries and openings 
with disk-shaped elements, flat and hemispherical, that is more accurate 
than the classical watershed-based algorithm. The last step of the system 
is classifying the parasites by morphological skeleton. 



1 Introduction 

In malarial blood the red corpuscles of vertebrates are infected by malaria para- 
sites. The parasite develops in a highly regulated manner through distinct cycles 
in the vertebrate host jHj- The parasite attacks red corpuscles, in which it first 
appears as minute speck of chromatin surrounded by scanty protoplasm, and 
gradually becomes ring-shaped and is known as a ring or immature trophozoite. 
It grows at the expense of the red cell and assumes a form differing widely 
with the species but usually exhibiting active pseudopodia, i.e. projections of 
the nuclei. Pigment granules appear early in the growth phase and the parasite 
is known as a mature trophozoite. As the nucleus begins to divide and take up 
peripheral positions, the parasite is known as a schizont. The infected red blood 
cell ruptures. Some parasites on entering red cells become round sexual gameto- 
cytes, instead of asexual schizonts. 

The aim of our system is to detect the parasites using a scan of a colour photo- 
graph of stained malarial rodent blood from a microscope in order to evaluate 
the parasitaemia of the blood i.e. counting the number of parasites per number 
of red blood cells P . A manual analysis of slides is tiring, time-consuming and 
requires a trained operator. So our task is to automate the process. The image 
processing system is made of three main steps: detection of parasites, cell seg- 
mentation and classification of parasites. In Section 2 we describe the different 
phases of the image analysis, beginning from parasites detection. We propose 
a method to automatically separate the parasites from the rest of an infected 
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blood image using colour and size information Then we introduce an effi- 
cient morphological method to segment cell images improving the accuracy of 
the classical watershed-based algorithm . The last aim of the analysis is clas- 
sifying the detected parasites using the morphological skeletons. In Section 3, 
the experimental results are presented, providing a comparison with the classical 
watershed-based algorithm and some numerical results. Finally, Section 4 draws 
the conclusions. 



2 A Morphological Approach to Malarial Blood Analysis 

Mathematical morphology is well suited for biological and medical image anal- 
ysis. In fact it offers a powerful tool for extracting image components that are 
useful in the representation and description of region shape, size and colour. Our 
proposed technique is based on morphological methods, using granulometries 
to evaluate the size of the red cells and the nuclei of parasites and the regional 
maxima to mark relevant bright image objects, i.e. to detect the nuclei of 
parasites. We are also interested in morphological techniques for pre- or post- 
processing, such as morphological filters to suppress or smooth some areas, 
thinning and gradient to improve cells contours and reconstruction by dilation 
to recover objects of interest (schizonts, infected red cells) 17 191 . 

At the moment the processed images are taken from a microscope, devel- 
oped on photographic paper and then scanned. This process would in future 
be automated. Our source images are taken at a range of magnifications, with 
some variation in stain colour and lighting conditions (see the sample 465x702 
pixels image in Figure P). We analyse the hue-saturation- value colour space. In 
both the hue component, H, and the saturation one, S, the bright objects are 
of interest. This is because the Giemsa staining solution stains up nucleic acids 
i.e. highlights the DNA in these objects, showing it as a dark, saturated pur- 
ple. Therefore the white cells and the parasites, which contain DNA, are much 
brighter than the other objects in the image. 





0 5 10 15 20 25 30 35 40 



Fig. 1. A sample malarial blood image Fig. 2. The pattern spectrum of the Hl- 

tered and closed saturation image 
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2.1 Automatic Thresholding Using Granulometry and Regional 
Maxima 

To detect the bright objects we convert the H and S images into binary im- 
ages using simple thresholding and the product of these binary images is then 
the marker image for the parasites and the white blood cells. So, our aim is to 
automate the selection of these two thresholds in order to separate the objects 
of interest, i.e. those containing DNA, from the rest of the image. The images 
to process contain noise from the sample, from the microscope light, from the 
chemical development process or from the scanner. The noise is smoothed by 
a median filtering using a 5x5 window, on both H and S images. In order to 
enhance the bright image objects and make flatter, darker and cleaner the im- 
age background, we apply a morphological area closing on both the H and S 
components. This is especially useful to better estimate the cell size from the 
pattern spectrum. 

Bright image objects can be detected by looking for the regional maxima in 
the image, in other words the parasites and the white blood cells correspond 
to the regional maxima of the H and S images. As all the objects of interest 
are roundish, it is useful to choose a disk-shaped structuring element for the 
connectivity of the maxima. The choice of its size is crucial for the effectiveness 
of the markers. A small disk locates too many regional maxima and by choosing 
too large a disk relevant image objects could be missed. But we are looking 
for parasite locations, i.e. regional extrema inside the red cells. So, the size of 
the connected components in the image, location of regional maxima, must be 
smaller than the size of the red cells. Therefore, the size of the structuring 
element defining the connectivity of the regional extrema could be chosen as 
equal to the size of the red cells. In order to evaluate the red cell sizes we apply 
a granulometric analysis on the image based on opening operations with disk- 
shaped structuring elements. FigureElshows the resulting size distribution for the 
saturation image (note the use of the disk-shaped SE). The histogram indicates 
the presence of two predominant particle sizes in the input image, relative to the 
nuclei of the trophozoites (3-8 pixels) and to the red blood cells (20-25 pixels) . 
Let us denote by c the greatest size of the red cells estimated from the pattern 
spectrum. In the sample image c is equal to 25. At this point it is possible 
to look for all the regional maxima in the H and S images, according to the 
connectivity of a disk-shaped structuring element with radius c. Let us denote 
hy MH and MS the marker images, containing the regional maxima of H and 
S, respectively. The marker image of the nuclei of the parasites, MHS, is then 
the intersection of the two marker images, MH and MS, after dilating both the 
images by a disk-shaped structuring element whose radius is equal to the nuclei 
size (equal to 5 in our sample image), i.e.: 

MHS = MHnMS. (1) 

The MHS image locates the regional maxima in the image, i.e. connected com- 
ponent set of constant grey level. But the parasite nuclei are not all of constant 
grey level and so the MHS image marks only a subset of the pixels within the 
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nuclei. To detect all the parasites we need a grey level characterising the nuclei. 
Let us denote by ^.H and the average grey level of the nuclei marked by 
MHS, computed on the H and S images, respectively, i.e.: 



where sum{MHS) is the number of pixels marked as regional maxima in MHS, 
H{p) and S{p) are the grey levels of the pixel p in the H and S images, respec- 
tively. The values pH and pS are chosen as threshold values to detect the objects 
of interest in the H and S images. The image THS identifying all the parasites 
and the white blood cells is then the intersection oiTH and TS\ 



where TH and TS are obtained thresholding H and S, respectively, by means 
of pH and pS (in Figure 01 the THS contours overlaid the original image). 

2.2 Detection of White Blood Cells and Schizonts 

The objects present in the thresholded image THS are parasites of all types and 
white blood cells. We can isolate the white cells by means of a morphological 
erosion with a disk-shaped structuring element whose size is achieved by the 
granulometric analysis. The size of white blood cells can be inferred from the 
average dimension of the red cells and hence the size of the structuring element 
can be selected as equal to this dimension since the chromatin spots are surely 
smaller than a red cell. In our sample case this size is 22. The white cell marked 
by this erosion is then reconstructed by dilation. 

In order to identify the schizonts in the THS image we look at how clustered 
the remaining objects are. This means we need a measure of separation of the 
objects in the image plane, in other words we need to compute the distance 
between sets in a discrete space. The distance between sets can be defined using 
the notion of Hausdorff distance Two sets are within Hausdorff distance A 
from each other iff any point of one set is within distance A from some point of 
the other set. Let us denote by A and B two sets. By a morphological approach 
the Hausdorff distance between these sets is the minimum of the radius A of the 
disks S such that A dilated by S\ contains B or B dilated by S\ contains A. 
All the objects whose distance is smaller than the average (or maximum) size 
of red cells make up a schizont. All the other remaining objects identify nuclei 
of parasites. Applying a morphological reconstruction by dilation of the mask 
THS image from the marker schizont image by a 3x3-cross structuring element, 
we are able to localise the schizonts in the input image (see Figure 0. 



sum{MHS) ^ 

'' ^ rr.i^A/T IT Q 



( 2 ) 



and 




( 3 ) 



THS = THC\TS 



( 4 ) 
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Fig. 3. The contours of parasites and 
white blood cell 




Fig. 5. The marker image used in the 
segmentation 




Fig. 4. White cell and schizonts detected 
and depicted in “white” 




Fig. 6. The contours of the red cells, sin- 
gle and composite 



At this point all the remaining spots in the THS image identify nuclei of 
trophozoites, immature or mature, and gametocytes. The evaluation of para- 
sitaemia requires also counting the red cells and so locating the red cells they 
infect. This requires the segmentation of the red cells. Before segmentation we 
remove from the input image the objects we have already identified (white cells 
and schizonts) so that the image to analyse contains only red cells, some of which 
are infected by parasites. 



2.3 Segmentation of Red Blood Cells 

The aim is to isolate each individual red blood cell, especially when they are 
overlapping and partially occluded and form clusters in the viewing field of the 
microscope, and so to locate and recognize the cell contours. To isolate the red 
blood cells we use the green component image because it is cleanest. However, 
thresholding the input image, we have observed that cells disappear into the 
background at the sides as noise comes up in the centre. Therefore we correct 
this non-uniform illumination using a paraboloid model of the illumination. 
Thresholding the green image at this stage would leave ’holes’ in the middle 
of the red blood cells. After filling the holes in the red cells by a morphological 
area opening, the last step of this pre-segmentation phase consists in making the 
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background flat and clean. So, we threshold the image, setting to zero all the pix- 
els belonging to the background, and remove items smaller than a red blood cell 
from it by means of a morphological area-open Alter. At this point segmenting the 
image means retrieving the red cell blocks. In order to identify red cell bodies we 
use the granulometric analysis on the image, already done in the previous phase 
of the processing. The image consists of objects of two main different sizes, cells 
and trophozoites. Some objects are overlapping and they also are too cluttered to 
enable detection of individual particles. Therefore the first step of the segmenta- 
tion process consists in estimating the size distribution to evaluate the smallest 
size s of the red cells and applying a morphological opening by a hemispherical 
structuring element of radius s on the image. In our sample image s is equal to 20. 

The morphological gradient on the result of the opening produces a first 
rough localisation of the red cell contours. We binarize the gradient image and 
close the holes in order to get a binary image we can use as marker image in the 
classical watershed-based segmentation |2| (see Figure E|) to And the contours of 
the red blood cells (see Figure El . This gives us a partial segmentation with some 
compound cells still to deal with. In fact the red blood cells can be overlapping 
or partially occluded. Therefore each cell body area detected by segmentation 
can identify both individual and composite cells. So cell parameters measuring 
the roundness are calculated. If a cell has a large ratio of the major axis over 
the minor axis (greater or equal to 1.3), i.e. it is elongated, it is treated as a 
composite cell (see Figure 0 ). The composite cells are then separated into the 
individual contributing cells applying a morphological opening by a flat disk- 
shaped structuring element of size s on the composite cells. The morphological 
gradient on the result of the opening produces a first rough localisation of the 
individual red cell bodies: the binarization of the gradient image produces rough 
individual contours. After closing the small holes between adjacent contours by 
a morphological area closing, a morphological thinning leaves contours around 
the individual cells (in Figure 0 the contours of all the individual red cells). 

2.4 Identification of the Infected Red Cells and the Parasites’ 
Nuclei 

Finding the trophozoites identifies the infected red cells in the image. Each 
trophozoite, both immature and mature, presents a nucleus we have already 
isolated in the THS image after having removed the white red cells and the 
schizonts. So, applying a morphological reconstruction by dilation of this mask 
binary image conditioned to the segmented image using a disk-shaped structur- 
ing element El) we are able to identify and isolate the infected red cells in the 
input image. 

2.5 Classification of Parasites 

The last step of the system is to classify the parasites in the other three classes 
of objects, immature trophozoites, mature trophozoites and gametocytes. In the 
paper we present a method still based on a morphological approach. 
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Fig. 7. The contours of the composite 
red cells 




Fig. 9. The classical watershed-based al- 
gorithm segmentation 




Fig. 8. The contours of the individual 
red cells 




Fig. 10. The parasites classification by 
morphological skeleton 



Each trophozoite, both immature and mature, is characterised by a nucleus, a 
circular spot of chromatin, particularly evident in the hue and saturation com- 
ponents of the input image. A mature trophozoite differs from an immature one 
because of the presence of active pseudopodia, the presence of pigment granules 
around the nucleus that become more numerous in case of a gametocyte. So, the 
classification is solved analysing the shape of the parasite automatically detected 
in the first step of the malarial processing using morphological operators. 

In all pattern recognition problems we need to extract the features of an ob- 
ject to classify it. In order to classify the parasites we try to simplify the shape 
as much as possible in order to make the classification, that is a topological 
analysis, as simple as possible. One possibility consists in creating a version of 
the pattern that is as thin as possible, i.e thinning the object to a set of ide- 
alised thin lines which summerise the information of the original object while 
preserving its topology. The resulting thin lines are called the skeleton or medial 
axis of the input pattern and they are the thinnest representation of the original 
pattern that preserves the topology. The detection of the endpoints is important 
for classifying our objects, being strictly related to the shape of a parasite. As we 
have already observed an immature trophozoite is characterised by a nucleus, a 
circular spot of chromatin, and so the skeleton does not present many endpoints. 



746 



C. Di Ruberto et al. 



While a skeleton of a mature trophozoite presents more endpoints, because of 
the presence of pigment granules around the nucleus. The number of endpoints 
increases in case of gametocytes. Many algorithms that generate digital skeletons 
have been proposed. But most of them produce a non-connected skeleton, that 
is useless for shape description application since homotopy is not preserved and 
characteristic points such as multiple points or endpoints in the continuous case 
are lost. Digital skeletons can be generated by thinning algorithms. In 0 the 
thinning process has been analysed, including the proof of convergence, the con- 
dition for one-pixel thick skeletons and the connectivity of skeletons. A digital 
set can be skeletonised so as to preserve these important properties by thinning 
the set with SEs preserving homotopy, i.e. homotopic SEs. The skeleton is ob- 
tained by thinning the input image with homotopic or a series of homotopic SEs 
and their rotations until stability has been reached. 

The skeleton of a binary object contains useful information about it and the 
endpoints of skeleton are an interesting shape descriptor that we have used for 
our pattern recognition purpose. The more endpoints of a skeletonised object 
more different the object is from a circular one. All the immature parasites are 
disk-shaped so the number of endpoints is small, while for mature parasites the 
shape is more irregular and rough, leaving the disk shape, and the irregularity 
increases in the gametocytes. We have used this morphological feature to recog- 
nize our parasites’ sample and the successful classification is showed in Figure 

im 

3 Experimental Results 

In Figures iTnni the different steps of the procedure are presented. In Figure ^ 
the initial sample image, scanned from a colour photograph of stained malarial 
rodent blood, is presented. In the image there exist all the different kinds of 
parasites, a polymorph white blood cell and red cells. In Figure El the pattern 
spectrum of the filtered and closed saturation component and in Figure El in 
white colour the contours of the parasites and white blood cell marked by the 
regional maxima, using a structuring element with radius 25. In Figure 0 the 
isolated schizonts and blood cell are depicted in white on the input image. From 
Figures 0 to El the main steps of segmentation phase are showed and in Figure 
Elthe contours detected by the classical watershed-based algorithm is presented. 
Finally, after having isolated the infected red cells, the classification of the ex- 
tracted parasites is illustrated in FigureEH The classificaton is obtained by using 
the morphological skeletons: the label IT is for immature trophozoite, the label 
MT for immature trophozoite while GA is for gametocyte. All the parasites have 
been correctly identified and classified. 

In Figure El we present the numerical results obtained processing 12 images 
of malarial blood. Each image has been analysed by two biologists (called A and 
B in the table) and by our method (C). For each image we have counted the 
immature trophozoites (IT), the mature trophozoites (MT), the schizonts (SC), 
the total parasites (PAR), the total red blood cells (RBC) and the percentage 
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Fig. 11. The counting results on some malarial images 



of parasites on red blood cells (% PAR), the percentage of trophozoites that 
are immature (% IT), the percentage of trophozoites that are mature (% MT), 
the percentage of parasites that are schizonts (% SC) and the white blood cells 
(WBC). The counting results of the two biologists are more different from each 
other than our counts are from either of them, i.e. the A-B comparisons are 
worse than both A-C and B-C. As it’s possible to observe from the numerical 
results, our automatic method seems to be a good compromise between the 
two experienced users. The experimental results have turned out to be very 
encouraging and we hope to test our method on a larger database of images. 



4 Concluding Remarks 

In this paper we have presented a morphological method to analyse malarial ro- 
dent blood images. The aim of malarial blood image processing is to detect the 
parasites infecting the red cells in order to evaluate the number of parasites per 
number of red blood cells. The proposed method identifies automatically the 
parasites using colour and size information, extracted by a morphological ap- 
proach. We have used the regional maxima to detect the nuclei of the parasites, 
according to a connectivity of a disk-shaped structuring element whose radius 
is the greatest size of the red blood cells. The latter is obtained by a granulo- 
metric analysis of a filtered component of the image in the hue-saturation-value 
space. We have also presented a morphological system to segment blood images. 
The granulometric analysis and the opening by non-flat disk-shaped structuring 
element are the main steps of the segmentation process. Granulometry is used 
to capture information about objects of particular size and shape. In our case 
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the objects of interests are cells, i.e. bright blobby image parts. We have used 
grey scale granulometries based on opening with disk-shaped elements. We have 
chosen a non-flat (hemisphere) disk-shaped structuring element to enhance the 
roundness and the compactness of the red cells before applying the watershed 
algorithm as these features could be lost or be too weak to produce an accurate 
segmentation if directly evaluated on the input image. In Figure 0 the classical 
watershed-based algorithm segmentation applied on the input image of Figure 0 
is showed. On the contrary we have used a flat disk-shaped structuring element to 
detect the points of contact in composite cells. Openings with flat disks capture 
information about the height of overlapping objects, allowing the separation of 
composite structures into the individual composing parts. Finally we have pre- 
sented a method for classification of parasites based on morphological skeleton, 
using endpoints as features for recognition. The sample images we have presented 
show that the proposed morphological analysis achieves very good and accurate 
performance. At the moment the method is dependent on the exposure and light 
conditions. So our future research will be focused on the automated choice of 
the morphological parameters in different exposure and magnification situation 
and on the examination of extending the proposed morphological technique to 
analyse malarial human blood images. 
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Abstract. The paper describes a vision processing system for traffic speed 
computation. Vehicle tracking is performed by using high-level features, that 
are clusters of license plate characters to achieve increased robustness of the 
system. By using a binocular arrangement of the Computer Vision system, the 
spatial and temporal range of image capture between consecutive views is 
extended, which leads to an improved accuracy in speed computation. Multiple 
views can be collected, from the same vehicle in transit, depending on the 
amount of speed. The effectiveness of the proposed solution is demonstrated 
through a geometric simulation model, where almost all operating conditions 
and constraints are fully exploited and tested. An error sensitivity analysis is 
carried out, to identify the most critical components of the system and the 
possible sources of en'ors in speed computation. This simulation approach is 
proved quite useful in this complex scenario, where it is commonly very 
difficult to collect true speed measures from the vehicles in transit. Beside the 
simulation analysis, a series of experimental results have been collected by a 
first prototype that is available since the end of year 2000. 



1. Introduction 

All public administrations all over the world are heavily involved in the identification 
of reliable solutions for traffic control. Most required functions are license plate 
reading and speed violation detection systems, to provide traffic flow and access 
control to restricted areas in downtown or historical centers. Speed enforcement is 
definitely one of the most important applications, since speed limits violation is one 
of the most relevant causes of accidents everywhere. 

Most established solutions currently in use are based on radar or laser technologies 
with an increasing interest in using optical systems [1]. In the last few years there has 
been an increasing exploitation of Computer Vision technology for vehicle tracking 
and speed measurement, due to the recent improvements in the algorithms and the 
availability of more processing power. Some results have moved from academic 
research labs to applied solutions [2]. Interesting solutions have been recently 
proposed [3] to compute the travel time of vehicles and the network level origin- 
destination matrix based on Computer Vision. It is based on the use of separate 
License Plate Readers placed along a road at a suitable distance (from hundred meters 
up to a few kilometers apart). The identification of the same license plate in the two 
sites allows the system to compute the average speed of the vehicle in this space 
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interval with high precision. The resulting measure can he successfully used both for 
monitoring purposes (to provide information about the average transit time along 
selected routes) as well as for enforcement (to detect speed violation). 

Our proposed approach is aimed to compute the instantaneous speed of the 
vehicles by using a binocular system placed on the side of the road, to frame the 
image of the vehicle at two consecutive times, very closed together. By tracking the 
vehicle in the two images it is possible to get a quite accurate estimation of the 
vehicle speed. Actually, at lower speed, a number of views can be collected from each 
individual acquisition camera. The features used for matching and tracking are high 
level features (clusters of matched license plate characters) to minimize matching 
errors. An error sensitivity analysis is also referred, by using a simulation model of 
the binocular configuration. In this way it is possible to identify potential sources of 
errors and predict the behavior of the system in the different operating conditions. 

The first section of the paper describes the proposed approach of speed estimation 
with a binocular vision system. Geometric constraints are discussed and the main 
mathematical relations used for speed measure. The second section shortly recalls the 
License Plate Recognition system by Elsag O^CR reading technology, which is used 
for high-level feature detection and tracking. The following error sensitivity analysis 
allows to point out potential sources of errors as well as to strengthen weak and strong 
properties of the processing chain, and the consequences of perspective and 
calibration constraints. 



2. Speed Estimation by a Binocular Vision System 

The proposed approach is based on the computation of speed from the geometric 
features of License Plates acquired by a binocular configuration as depicted in fig. 1 . 
The License Plate LP is supposed to be framed from the first sensor SI with a time 
stamp tj=ndt, and from the second sensor S2 at time t 2 =mdt where the time sampling 
dt is the same (i.e. 20 msec, at 50 Hz) since both cameras are synchronized together. 

Fig.l shows the different parameters involved in the process, as will be later 
explained in the text. As clearly shown in the figure, all the analysis is carried out in a 
reference trajectory plane which is supposed to contain both the optical axes of the 
two cameras and the trajectory of the license plates. This model does not take into 
account the different heights of the license plates of the vehicles. Such approximation 
has been proved to have negligible consequences in the measurement process. The 
common reference system is centered into the acquisition unit in between the two 
imaging sensors. In general the orientation of the license plates may be unpredictable 
but it is quite likely to be orthogonal to the current trajectory, and this can be used as a 
system constraint. Landmarks L^, L^, L^, the nominal trajectory TN, and the distance, 
dbs, from the road-side, of the acquisition system, are calibration parameters and will 
be discussed later. Fig.2 summarizes the binocular speed measuring process. A 
vehicle transit event TE{Lpj,Lp 2 ) consists in the matching of two consecutive views 
(LpipLpjj) of the same license plate, along the current trajectory. Of course there could 
be many instances of such correspondence, depending on the actual speed of the 
vehicle and the efficiency of the License Plate reader. 
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Fig. 1. Geometric configuration of the binocular system installed on the road side, The 
acquisition system is identified by the two camera unit COl and C02. 

Once a match has been found, the essential imaging features are extracted from the 
cluster of recognized characters, namely the center of mass of the clusters, bj= 

(ub|,vbj), bj= (ub 2 ,vb 2 ), and the average size of the characters, sZj=(dh|,dl,) and 
sz 2 =(dh 2 ,dl 2 ). The next step is the computation of the position of the license plates 
Lp,,Lpj in the reference plane, i.e. the vectors R,* and R^* as shown in fig.l. 

The centers of mass (bj , ) are the projections of the license plate onto the 

imaging sensors and provide information about the direction of the vectors Rj*, R^*. 
The module of the vector (i.e. the distance of the license plate from the sensor) is 
obtained from the average vertical size of the characters (dh), as: 

|R*| [mm] = f [pixels] * HT [mm] / dh [pixels] (1) 

Where HT is the height (in mm.) of the actual character, f is the focal length of the 
imaging sensor, in pixel units, to take into account intrinsic parameters of the sensor, 
and dh is the average height of the matched characters (which can be computed with 
subpixel precision). 

When the two vectors Ri*, R 2 *, from the two matched views (LPi, LP 2 ) of the same 
license plate (at time stamps ti=ndt, t 2 =mdt), have been computed, the speed 
estimation is quite straightforward: 

''"est = ( ds / (m - n) dt ) / cos 6 (2) 

where tg 5 = ns / ds 

ns = |R,*| sin((p,- (3,J - |Ri*| sin((Pj- J + BT cos(a) 
ds = |Rj*| cos((p,- p,J - |R,*| cos((p, - pj J + BT sin(a) 
and a = (tpj+ (pj) /2 

(pj , (p 2 are the angles of the two optical axes of the sensors w.r.t. the nominal 
trajectory, BT is the telemetric basis of the binocular system, pj^^^ , p^^^^ are the angles 
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of orientation of the vectors R,*, R^*, w.r.t. the optical axes of the sensors, (in the 
reference plane) and are obtained from the centers of mass b^b^as: 
tg Pies. = ubi / f, [pixels] 

tg p2es, = ub, / f, [pixels] 




Fig. 2. Flow diagram of the binocular speed measurement process 

In the above formula (3) the angle 6 is an important cue on the reliability and 
consistency of the measure. Moreover, a bias of 6, in many consecutive speed 
measures, may be a serious hint of an error in the installation of the sensor, as briefly 
discussed in the following. 

3. Experimental Set-Up of the Binocular System 

A prototype version of a binocular vision system for license plate reading from the 
road side has been integrated and tested. Fig. 3 shows the transportable system in a 
recent experimental test along the highway. It consists of three components: a 
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binocular acquisition unit, a processing box and a battery to provide the required 
autonomy in the different applications. 




Fig. 3. Experimental binocular system 



Fig. 4. Landmark L3 installation procedure 



The selected imaging units are high- sensitivity cameras. For each specific 
application it is possible to select the most suitable optical parameters (from 12 to 75 
mm focal length) and control convergence/divergence of the optical axes for each 
individual camera, during the calibration phase. Fig. 3 shows the acquisition head, 
with the two cameras combined with an IR illumination source (based on halogen 
lamps) to achieve maximum efficiency to operate at long distance (over 24 meters). A 
TFT-LCD interface unit has been integrate in the rear part of the acquisition unit, to 
optimize human-machine interaction, simplify installation procedures (fig.4) and 
provide a visual feedback on the behavior of the system. 

Actually the system has been developed primarily for road side License Plate 
reading in security applications where the transportable unit from the field can be 
connected to an operative supervisory unit for data collection and decision making. 
The proposed application for speed measurement is one of the additional features of 
the system. Other potential applications are data collection on the traffic flow within 
urban areas or along selected ways. 



3.1 Calibration and Installation Procedures 

A very simple and effective calibration procedure has been developed in the lab 
environment, to fix the acquisition geometry, according to the selected operating 
conditions (i.e. the distance from the road side) 

Two license plates are placed along the nominal trajectory in positions LI and L2 
(as in fig.l) to be acquired by the two cameras, along the optical axes, with a nominal 
scale factor (image resolution) corresponding to the selected focal lengths at the 
predicted distance (Rj, R^). 
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A third vertical landmark L3 is placed at a pre-selected distance (as depicted in 
fig.l), to control the correct orientation (pan angle) of the whole acquisition system 
w.r.t. the nominal trajectory. The vertical landmark L3 must be acquired by the sensor 
C02 at a predefined column of the image. When the system is mounted in the field it 
is sufficient to verify this correspondence on the image plane to accept the correct 
installation as shown in fig. 4. Actually it is possible to build a table of 
correspondences between the position of the landmark L3 and the distance from the 
side of the road (dsb in fig.l). It is worth to remark the basic assumption of no slope 
on the road and the acquisition head is supposed to be placed almost at the same 
height of the license plates to be read (from 50 to 90 cm above the ground). 



4. License Plate Recognition 

The key feature of the process is the availability of an extremely effective License 
Plate Recognition system able to detect the maximum number of instances of the LP’s 
in transit with high accuracy. The LPR system should be able to work continuously on 
the video sequence at the maximum rate (50 Hz for a standard CCIIR) to pick-up the 
license plates from the video flow, without the help of external triggering devices. It 
is required an extremely accurate precision of character segmentation and localization 
since any error at this level would propagate to the range estimate R* and the 
following speed measure in equation (3). 

Elsag has developed its first LPR (License Plate Reading) systems since the 
beginning of the 90’ s in a variety of applications [4], from pay-toll gate control 
(Telepass) along the Highways, parking access control, urban traffic monitoring and 
Road Pricing applications. This technology is the result of a long experience in Elsag 
on the subject of Intelligent Character Recognition applied to document and form 
processing [5] as well as for address understanding in mail sorting, including cursive 
handwriting [6]. A wide variety of solutions have been experimented using both 
statistical and neural network solutions. The adopted solution is based on a fairly 
classical pattern matching approach, and it represents a satisfactory compromise 
between accuracy and computational efficiency. 

The task of license plate recognition involves many system-engineering issues and 
multidisciplinary competence of electro-optics, and computer vision. Infrared lighting 
is widely used especially at night, to minimize external uncontrolled conditions [7]. 
The position of the sensor and its orientation are other essential issues, to minimize 
perspective deformations of the characters and achieve the necessary resolution in the 
scene. In general, whatever kind of camera is used, it should be placed to avoid 
occlusion of the license plate, and allow the acquisition of the plate field as much as 
squared as possible. Angled views of up to 30°, in the horizontal or vertical plane are 
usually manageable. It is good practice to ensure that the license-plate characters 
appear as near vertical as possible in the digitized image. 

The following section summarizes the real-time video processing approach which 
is used by O^CR/LPR Elsag proprietary solution. 
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4.1 The Plate Recognition Process 

The recognition process consists in three main blocks [8]. A first pre-processing block 
performing a screening of the video sequence in order to select images or regions of 
interest which are affected by motion. 

An image processing engine performs license plate location, character 
segmentation, OCR and context analysis and validation. 

A temporal post-processing block follows, to perform data fusion and tracking 
among different images of the video sequence. The context verification process 
exploits both spatial and syntactic information in order to select the best hypothesis 
for the numberplate. 

The final temporal post-processing stage aims to extract a single numberplate for a 
given vehicle. The presence of a vehicle is detected by the observation of a group of 
video frames affected by image motion and showing a sufficient number of 
recognized characters. Such definition allows the detection of a vehicle even if the 
contextual process does not extract any numberplate hypothesis. 

All the number plate hypothesis, with both spatial and syntactic information, are 
gathered from the image belonging to such group and a clustering approach is used in 
order to evaluate the best choice. Main characteristics of the 02CR LPR system are 
briefly summarized as follows [8]: 

• Adaptive control of the exposure time of the camera to optimize the contrast 
image of the license plate 

• Data temporal integration and feedback process to recover character 
segmentation errors. 

• Continuous processing on the video sequence (free-running) and simultaneous 
multiple license reading, without the need of external triggering systems. 

• Standard PC processing environment. 

• Integration of high sensitivity imaging sensors with IR-LED integrated 
illumination system. 

• Recognition performance (certified by Italian standards UNI10772) 

• Between 90 and 95% according to operating conditions, with reading tolerance 
over 30° 

• Over 97% in favourable conditions 

• Processing speed at video frequency (50 fields/sec) to detect high speed transit. 

• Multinational license plate (multifont and training to deal with various contextual 
information) 



5. Simulation Results (Error Sensitivity Analysis) 

To be able to predict the performance and the precision which can be reached by the 
system, a simulation model has been implemented, trying to generate all possible 
situations which will appear in the real field. Following the block diagram of fig.l., a 
first input section has been arranged. It allows to select a wide range of operating 
conditions, including optical parameters of the imaging sensors, the supposed 
trajectory (near and far from the road side) the amount of speed of the supposed 
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vehicle, relative orientations, etc. It is possible to take into account also the 
orientation of the plate in the vertical plane, to verify their consequences in the speed 
estimation process. A second module has been implemented to simulate the image 
processing section, to generate the same amount of information as it will be provided 
by the License Plate reader, i.e. position and size of the recognized characters. The 
final section is devoted to implement the actual speed measurement process. 

The main limitation of this approach is the use of a purely geometric simulation 
model. As such all simulation results are based on the assumption to use an ideally 
fast imaging acquisition system with an indefinitely short integration time and with an 
optimal intensity response. On the other hand, by separate testing the image 
acquisition and processing system, it has proved that the required performance can be 
successfully achieved in the real situation. 

Actually the primary source of errors has been proved to be the precision of the 
image features localization and in particular the size of the matched characters in the 
two image planes. 

The following table refers the computed errors in the speed measures by comparing 
the results obtained with an integer quantization (Np=l) of the character heights (dhj, 
dh^), a subpixel precision of 10% (Np=10), and a precision of 1% (Np=100). The 
other main simulation parameters are: 

□ Focal lengths: fj= 25 mm, f= 50 mm, 

□ Dbs= 2 m. Nominal trajectory TN= 5 m, 

□ Displacement error of landmark3 = 47 cm. 

□ Vertical slant of the license plate = 5° 

□ Trajectory deviation 0 = 2°. 



Table 1. Speed measurement results (err %), as a function of character size precision (Np) for a 
reference speed of 120 km/h 



Np 


1 


10 


100 


v-est [km/h] 


133,12 


121,01 


120,51 


err % 


10,93% 


0,845% 


0,424% 


dh, 


13 


13,2 


13,28 


dh. 


13 


13,8 


13,87 



Actually the size of the characters is by definition an average measure from a 
number of characters (typically 7) and Np=10 precision is always experimentally 
achieved. Moreover improved results can be achieved by time integration (averaging 
multiple estimates from the possible matching pairs of the same license plates on both 
image sensors). The following table refers a comparison of results (with Np=10 
precision), for different speed values. V-est refer the computed speed value, err% is 
the computed error, (dh,, dh^) are the computed vertical sizes of the characters, (Nr^, 
Nrj) are the maximum number of readings in the two image planes COl and C02 
respectively. 
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Table 2. Simulation results at different speed values 



Actual speed 
[km/h] 


75 


95 


120 


150 


200 


v-est [km/h] 


74,85 


94,84 


120,64 


149,56 


201,39 


err % 


-0,19% 


-0,168% 


0,533% 


-0,296% 


0,694% 


delta-est[°] 


2,30 


2,31 


2,42 


2,29 


2,45 


ht1 


14,20 


14,20 


14,20 


14,20 


14,20 


ht2 


13 


12,9 


12,9 


12,8 


13 


Nreadingl 


6 


5 


4 


3 


2 


Nreading2 


6 


5 


4 


3 


2 



In this case an installation error of the landmark L3 (-30 cm) has been simulated, 
but no relevant errors have been introduced in the measures. The actual trajectory has 
been supposed to have a slant direction of 3° to the right, with respect to the lane 
direction. The speed measurement error is always below 0,7% and is quite constant, 
irrespective of the amount of speed. As it was anticipated, this is mainly due to fact 
that the simulation results do not take into account possible radiometric problems of 
the acquisition system. An estimate of slant orientation is computed (delta-est) and is 
quite in accordance with the actual simulated direction slant of 3°. The number of 
readings (in the assumption to work at maximum frequency of 50 fields/sec) is 
obviously higher at lower speed values. Even if in the real situations not all such 
readings will be usable for speed measurement computation, there is a significant 
possible improvement in precision by averaging multiple estimates. 

Moreover, the simulation model allows to check the correct configuration.. For 
instance it is proved that the system is able to detect possible errors in the 
displacement of the installation landmark L3, and provide the necessary information 
to correct them. Moreover it is possible to see the effects of other critical situations 
like the slant of the license plate in the vertical plane or the direction of trajectory (0). 



6. Experimental Results 

The prototype speed measurement system (as in fig. 3) has been evaluated in two 
different conditions. A first analysis has been carried out in a fully controlled 
environment in the lab, to map the critical function R*(dh) of eq.l. The obtained 
results are found within the range of precision required (Np=10). The second test has 
been performed in the field to prove the detection capability of the binocular imaging 
system when dealing with a wide range of speed, from slow trucks up to very high 
speed cars. A series of experiments have been performed along a 2-lane highway to 
read the license plates of the vehicles in transit. To provide a ground truth on the 
computed speed a commercial laser-based speed measurement system has been used. 

The obtained results (although yet limited in number) have been satisfactory for 
measured speed above 180 km/h. Further experiments are planned in the near future 
and will be extensively referred and discussed at the conference. 
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7. Conclusions 

The paper refers a proposed approach for speed measurement based on a binocular 
configuration which has been already experimented successfully for license plate 
reading of vehicles at high speed along the highways. The proposed approach has 
some interesting and promising features: 

> Joint speed and LP reading to support vehicle identification with a unique 
measure 

^ High precision, due to a possible time integration and average of multiple 
readings, virtually limited only by shutter speed and light intensity 

> Potential of increased performance with increased resolution of visual sensors 

> Wide measurement range (more than a single lane) even with conventional 
resolution digital cameras 

> Self-awareness to control the installation conditions, to discard noisy ambiguous 
data 

> Many possible configurations (short and long distance) depending on the range of 
maximum speed detectable 

> Possibility to use partial readings of the license plate in order to perform license 
plate matching and speed measurement 

The referred results are also supported by a simulation model which provides a 
nice control of the main parameters of the system. The next steps of the development 
will involve a thorough investigation and experimentation of the system with different 
sensor configuration (narrow IR and high speed sensors). Moreover the system will be 
evaluated against certification requirements to get a qualified speed measurement 
system for speed enforcement purposes. 
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Abstract. Volumetric images of small mortar samples under load are 
acquired by X-ray microtomography. The images are binarized at many 
different threshold values, and over a million connected components are 
extracted at each threshold with a new, space and time efficient program. 
The rapid increase in the volume and surface area of the foreground 
components (cracks and air holes) is explained in terms of a simple model 
of digitization. Analysis of the data indicates that the foreground consists 
of thin, convoluted manifolds with a complex network topology, and that 
the crack surface area, whose increase with strain must correspond to the 
external work, is higher than expected. 



1 Objectives and Scope of the Paper 

Many attempts to model or recognize shape and form are based on a bi-level 
representation of relatively simple objects. In contrast, we are faced with an 
engineering problem characterized by sequences of large, complex, volumetric 
gray-scale images. This data was produced by a unique imaging instrument de- 
signed for observing the internal structure of dense, heterogeneous materials. 
The resulting measurements will ultimately be used in multiscale modeling of 
the microstructure for improved understanding of the macroscopic mechanical 
properties of concrete mm- So far we can report only some observations from 
which we attempt to deduce aggregate and individual shape properties of a large 
collection of objects and to separate material properties from image processing 
artifacts. Our work raises far more questions than it answers. We present it 
here in the hope of gaining assistance from the segment of the image processing 
community dedicated to allied pursuits. 

More specifically, we propose to analyze thin, warped, interconnected volu- 
metric entities in sequences of density images of samples of mortar. The data is 
obtained by high-resolution 3-D microtomographic imaging using an X-ray im- 
ager at the National Synchrotron Light Source at Drookh a, yen . The images 

show crack formation in mortar and concrete under increasing strain. Other ap- 
plications with similar filiform and quasi-manifold configurations are membranes, 
plant and animal vasculature, nerve fibers, and polymers, all of which can be 
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imaged through soft-tissue tomography, magnetic resonance imaging, 3-D ultra- 
sound, or confocal microscopy. The topological complexity of such data precludes 
2-D analysis. 

The detection, quantification and further analysis of structural changes in 
cement-based materials offers the potential for a more rational approach to the 
design, testing, repair, or replacement of concrete structures. The total replace- 
ment value of concrete structures in the US has been estimated to be over six 
trillion dollars. While continuum approaches based on plasticity and linear elas- 
tic fracture mechanics have led to considerable success in predicting failure in 
fine-grained materials such as metals, non-linear effects have resisted analysis 
in heterogeneous and quasi-brittle materials such as concrete. The study of mi- 
crostructure coupled with traditional stress-strain measurements offers the most 
promising approach mm- In concrete, cracks are thought to originate from 
one or more porous voids, and they may even spread preferentially through 
voids and pre-existing cracks. We hope that the detailed mechanisms of crack 
origination and propagation may be revealed by 3-D X-ray imaging. 



2 Data Collection 

Microtomography yields a 3-D map of absorptivity from hundreds of through- 
transmission radiographs of the specimen taken from different angles. It is similar 
to medical CAT scans, except for much higher beam intensity and detector 
resolution. The specimen is rotated on a stage designed to allow the application 
of a load while minimizing X-ray absorption 1 f 1 2|1 ,‘t] (Figure HT . 



^spedrnen under load (oomptession) 
—aluminum reaction rods (tension) 




•notafion stage ^ 
microscope elective 



Fig. 1. (left) The system for x-ray microtomography; (right) the load cell for holding 
the composite material specimens under calibrated loads. 



The specimens are small mortar cylinders under axial compressive load. The 
stress is continuously monitored by a conventional load cell, and the platen-to- 
platen displacement by a linear-voltage displacement microprobe. The data is 
collected and preprocessed by the University of Maine team at Beamline X2B. 
The X-ray source is synchrotron radiation with a highly collimated narrow-band 
beam monochromated to 32keV. The detector is a phosphor plate from which 
light is captured by a high-resolution CCD camera. The specimens are exposed to 
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the beam at 720 different angles over a 180 degree range. Each exposure lasts 8-12 
seconds, depending on the synchrotron beam current. This combination results 
in a very high resolution (2-6 ^m) 3-D array, but the capture cross section is 
limited to about 6mm by the beam width. 




Fig. 2. Two 2-D circular slices of concrete, gray-scale, one each from successive loads, 
at corresponding heights. Enlargement of 100x100 squares across the large crack. 



The resulting data (Figure EJ consists of sequences of 3-D integer arrays 
representing localized X-ray absorption. The dimensions of each array, recon- 
structed by the EXXON Direct Fourier Reconstruction algorithm on site, are 
typically 1024x1024x800 voxels. A new array is generated from the specimen 
after each of 5 or 6 load-and-release cycles with progressively greater loads. The 
last image is intended to capture the state of the specimen after the peak of the 
stress-strain curve, when any further load would cause it to crumble. Each sam- 
ple requires about 6 hours of ’’beam-time”. The images are originally recorded 
as 32-bit floating-point numbers, and then scaled to eight bit integers. In our 
representation, high gray values (shown as white or light gray) indicate high 
X-ray absorption, and low values (shown as dark gray or black) indicate voids 
consisting of cracks and air holes. When the data is binarized to 0-1, a higher 
threshold increases the number of foreground (black, i.e., 0) voxels. 

3 Methods and Observations 

The surroundings of the sample concrete cylinder are transparent to X-rays 
just as are the voids inside the cylinder. However, many cracks reach the ex- 
ternal surface. In order to apply connected-components analysis, it is necessary 
to separate the exterior volume from the crack volume. This is accomplished 
by ’’shrink-wrapping” the cylinder. The resulting cross-sections are neither truly 
circular nor convex, and change along the axis. 

Our 3-D processing relies heavily on connected components (CC) analysis 
[Il4ll,^ll6j . We have developed and tested a robust algorithm that is space- 
and time-efficient because it is implemented as a Find-Union on 1-D runs, 
with path compression. In a test array of size 800x800x765 (489,600,000 
voxels), it finds six million six-connected components in 200 seconds (400 
MHz Pentium with 640Mbytes of RAM). In addition to listing all of the 
connected components with their constituent voxels, the program reports 
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the volume, surface area (number of free faces) and the number of fore- 
ground runs in each CC. The code and test cases are freely available on 
http : //www . ecse . rpi . edu/Homepages/wrf /research/ connect/. 

We use VTK, the 3D Visualization Toolkit, to visualize the cracks. VTK 
is an open source, surface-based rendering software system from kitware.com. 
Its rendering support is based on triangulating the gray-scale isosurfaces using 
Marching Cubes IT7ITSI . Other routines were developed to analyze the volume 
distribution, free surface histograms and merge graph of the connected compo- 
nents. 

The remainder of the paper presents the observations in detail and attempts 
to explain them in terms of the characteristics of the sample and of a simple 
model of the digitization process. 



3.1 Effects of Amplitude Quantization 

The radiographic quality of the image data is quite consistent. There is little 
variation in grayscale from sample to sample, because fluctuation in the elec- 
tron beam intensity is compensated for by periodic recording of a blank picture 
(without the sample). The gray scale does not capture the full dynamic range. 
The high absorption regions of mortar have a uniform value of 255 and some 
cracks and voids are saturated at 0. Nevertheless, there is sufficient contrast to 
discriminate the structure. 




Fig. 3. Number of CCs, foreground (empty) volume, average CC volume, and surface 
area against binarization threshold for the whole shrink-wrapped volume of a sample. 



The aggregate (larger particles) is even more opaque to X-rays than the 
mortar, while air is transparent. Given the high-contrast nature of the object 
under study, any threshold in a wide range should be equally satisfactory for 
isolating the voids (cracks and air holes). It turns out, however, that the choice 
of threshold has a very significant effect on the characteristics of the binarized 
image. 

According to Figure 0 the total foreground volume increases gradually from 
4% of the sample to 24%. The apparent crack- volume changes by a factor of 



Volume and Surface Area Distributions of Cracks in Concrete 763 

more than three with threshold in the operational range of threshold from 40 to 
60. This is a much larger change than that due to increased load at a constant 
threshold. It can be explained by the model of digitization presented below. At 
the same time, the number of CC’s rises from nearly 1.8 million to nearly 5.5 
million, then decreases to 1.8 million again. We conjecture (see below) that at 
the lower thresholds only thick voids are revealed. The eventual drop is expected: 
if the threshold is raised above the value of all voxels, then the entire sample 
will consist of a single connected component. Figured shows representative 2-D 
cross sections at two thresholds at successive loads. 



B3 




Th=40 





Fig. 4. 2-D bilevel X-section at thresholds of 40 and 60 at successive loads. 



Because the work performed by the external force is expected to equal the 
work required to stretch the internal surfaces, the crack surface area is an im- 
portant parameter. Furthermore, the ratio A/V of surface area to volume (akin 
to perimeter/area in 2-D) is a useful measure of rotundity that may separate 
cracks from air holes. The ratio jV is a shape invariant. 




Fig. 5. Surface area to volume ratio Fig. 6. Free surface distribution 

of the largest CC in a 200x200x200 
voxel block. 



The Area/ Volume ratio of the largest CC in a 200x200x200 block (Figure 
EJ falls as expected to a threshold of 40, then rises as even thinner and more 
tortuous cracks are merged to it. The maximum Area/Volume ratio is only 2.5. 
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In compact objects, most of the voxels would be either interior voxels, with 
no free face, or surface voxels with exactly one free face. The number of free faces 
is shown in Figure 0 for a 20x20x20 cube and a one-pixel thick 90x90 slab. In 
concrete, there are many voxels with several free surfaces, as seen in Figure Q 
indicating highly irregular, tortuous surfaces. The skew increases with threshold, 
because although the larger CCs have more interior voxels, we are adding many 
smaller cracks. 

The logarithmic scatter plots of Figure 0 show that the CCs range from flat 
or filiform (upper envelope: AjV constant) to filled-out shapes (lower envelope: 
At- jV constant). The larger the components, the thinner they are. 





Fig. 7 . (left) Bar chart of distribution of free faces at two thresholds; (right) scatter 
plot of surface area vs. CC volume at two thresholds. 



3.2 Effects of Spatial Quantization 

Because of the presence of so much boundary surface (between foreground and 
background), reducing the resolution by subsampling by 8 the data has a very 
different effect from averaging it over 2x2x2 volumes. Figure 0 compares his- 
tograms resulting from subsampling and interpolation. This result indicates that 
we can expect radically different results as the resolution of the imager is en- 
hanced. 



3.3 Crack Size and Connectivity 

Figure 0 shows that the distribution of crack size is qualitatively similar at 
different thresholds. About half of the CCs are smaller than 1000 voxels, while 
the largest CC accounts for one third to one half of the total foreground volume. 
The CCs span six orders of magnitude in size. 

The presence of a huge number of tiny foreground CCs at any threshold is 
certainly suggestive of noise, but may also be a property of the material. We 
will soon obtain multiple images of the same sample to resolve this question. We 
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Fig. 8. Grayscale histogram of re- Fig. 9. The empty voxel distribution versus 
duced samples. CC volume of a 200x200x200 voxel block.) 



have also noted patterns of horizontal circular caused by the reconstruction of 
irregularities in the phosphor or spread of the X-ray beam. 

It is possible to trace the merging of the largest components as the threshold 
changes (Figure EJ. Each merger results in an abrupt increase in the volume 
and surface area of the resulting composite CC. The mergers are caused by 
the emergence of thin ’’bridge” cracks as the threshold is increased. A complete 
merge graph has millions of nodes. Ultimately we are interested in tracing the 
merger of cracks with increasing load. 



A(259986) B(48131I) C(6024) D(2843) E(2810) F(224^l G(2065) 1(1643) J(1173) 

aJ 74893) B(ji58) C(677^ D(35 8 4) £(3086) F(2350) G(2|83) H(2j9CI) 1(1741) J(12|3) 



A(295944) B(51814| C(12938) E(3378) F(2437) H(222 3l K(2061) L(1317) 1(1290) M(124^ 



A(343408) B(54907) N(3839) L(5773) K(3633) E(5074) F(2318) H(2491) 0(1846) P(1560) 



A(515677) K(20560) Q(3331) R(2854) F(25 66) S(2179) T(1786) U(1780) V(1655) W(1334l 



AI733007) 0(5445) 1(3888) X(3544) Y(278<| F(2604) 2(1813) AA(1746) AB(1702) W(1437) 



A(107971T) AC(3608) AD(2686) F(265JI W(1676) AE(1461) AF(1309) AO(1305) AH(1197) Al(1114) 



Fig. 10. Merge graph of the largest cracks(the numbers being the volume of cracks). 



3.4 Point Spread Function 

Most of the above observations can be explained using a simple model of dig- 
itization. We present the model in one dimension in order to be able to graph 
the functions. In the model, the cracks have a constant (one) density. The width 
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of the cracks is distributed exponentially. The space between cracks is also dis- 
tributed exponentially. The point spread function is modeled with a Gaussian. 
The ‘analog’ crack signal is convolved with the Gaussian, then thresholded and 
sampled. (The relative order of thresholding and sampling is immaterial.) The 
left of Figure E] shows the original crack distribution, the convolved signal, and 
the binarized distribution at two different thresholds. As the threshold is de- 
creased, nearby cracks are merged and new, thinner cracks appear. The width 
of the point spread function is more than 2 voxel diameters, as indicated by the 
cross section of cracks in the right of Figure 01 





15r 

After 1 _ 
Thresholding 
at 0.3 0.5 



100 






Fig. 11. (left) Model of Digitization; (right) sample profiles of grayscale across larger 
cracks. 



4 Discussion 

The samples consist of a huge number of very thin cracks, and a few wide cracks. 
The entire volume, except for the aggregate, appears to be traversed by cracks 
(craquelure). It is possible that most of these cracks are connected, but the 
point-spread function of the imager is too large to yield convincing proof. 

A fast GG program is essential for studying the image at a wide range of 
threshold settings because the large point-spread function of the imaging sys- 
tem, compared to the spatial sampling interval, obscures the intrinsically high 
contrast between mortar and voids. This also accounts for the rapid growth in 
the volume of black pixels as the threshold is increased. For now, we can only 
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speculate whether the point-spread function is dominated by the phosphor or 
the granularity of the reconstruction algorithm (the cooled CCD camera is not 
a likely culprit). 

The thin cracks have very convoluted boundaries, which result in a high 
surface-area to volume ratio. Therefore, visualization software yields very little 
insight into the structure, and a quantitative approach is required. With increas- 
ing threshold, the number of boundary voxels increases faster than the number 
of interior voxels, resulting in an overall increase in the surface-area to volume 
ratio. The results on subsampling indicate that the current spatial sampling res- 
olution is insufficient to give a true measure of the (possibly fractal) surface area 
of the crack boundaries. Such a measure is necessary to compare the increase 
in total crack surface area from load to load with theoretical predictions based 
on the loading and relaxation stress-strain curves. Although the resolution of 
the X-ray beam and of the optical system allows us to increase linear resolu- 
tion by at least a factor of four (at the cost of a 64-fold increase in acquisition 
time and data volume, and a corresponding decrease in sample volume), electron 
micrographs would be useful here. 

Under load, the volume of the largest connected components grows more 
quickly than the total volume of black pixels, as observed at any threshold. 
Equivalently, as the load increases, the volume of small connected components 
relative to the total black volume decreases because their expansion under load 
results in the small cracks being connected to the larger cracks. This effect is 
superficially similar to apparent crack growth with increasing threshold. 



5 Future Work 



We have much work ahead of us. We don’t yet have any effective measures of 
crack shape and crack topology. Before modeling crack propagation under load, 
it will be necessary to model both the cracks and the density variations in the 
material itself. Furthermore, current 3-D image registration techniques will have 
to be extended to the compound problem of bringing into correspondence objects 
exhibiting both global distortion (motion of the sample) and local changes due to 
crack growth. After separating cracks from noise specks and air holes on the basis 
of volume and surface area, we can verify that cracks grow from load to load, 
air holes remain the same, and noise specks appear and disappear randomly. 

Another important issue is the pore structure connectivity that governs water 
penetration from the surface. From the CC analysis we can determine which 
cracks open to the surface. By tracing these cracks, we can compute what fraction 
of the volume is a given distance from the surface. The change of permeability 
with crack growth affects long-term durability of concrete. An open question 
is the rate at which hairline cracks merge to form macro-cracks under load. 
Our long-term goal is the parametrization of crack growth for multiscale finite 
element modeling and analysis. 
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Abstract. A comparison is made of global and local methods for 
the shape analysis of logos in an image database. The qualities of 
the methods are judged by using the shape signatures to define a 
similarity metric on the logos. As representatives for the two classes of 
methods, we use the negative shape method which is based on local 
shape information and a wavelet-based method which makes use of 
global information. We apply both methods to images with different 
kinds of degradations and examine how a given degradation highlights 
the strengths and shortcomings of each method. Finally, we use these 
results to combine information from both methods and develop a new 
method which is based on the relative performances of the two methods. 

Keywords: shape representation, shape recognition, image databases, 
symbol recognition, logos 



1 Introduction 

We examine three different approaches for classifying images with several com- 
ponents in an image database. One approach uses local methods to represent 
the image, the second uses global methods, while the third combines both using 
an adaptive weighting scheme based on relative performance. The local method 
uses so-called negative symbols, as described in to compute a number of 
statistical and perceptual shape features for each connected component of an 
image and its background. The global method uses a wavelet decomposition of 
the horizontal and vertical projections of the global image as described in |5|. 
As a sample application of well-defined multi-component images, we use logos. 

Several studies have reported results on some form of logo recognition. Each 
study used either global or local methods. These include local invariants nn, 
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97-12715, EIA-99-00268, and IIS-00-86162 is gratefully acknowledged. 

** Currently at IBM Research Lab, Haifa 31905, Israel. 
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wavelet features |^, neural networks |3, and graphical distribution features |^. 
The performance in case of certain degradations was examined. 

In this paper we compare the local and global methods under the influence 
of several image degradations. The performance measure is the ranking of the 
original logo after inputing a degraded version of it into the classifier. The re- 
sults exhibit the advantages and disadvantages of local methods, based on shape 
features, in contrast to global methods, rooted in signal processing. Finally, we 
present an algorithm that combines both methods into a single, robust frame- 
work by adaptively weighting the contributions of each method according to an 
estimate of their relative performance. 



2 Preprocessing: Normalization of the Images 

The classification methods should be scale, translation, and rotation invariant. 
To achieve this, we apply some preprocessing steps to the input images before 
we start the computation of any features. The logos contained in the UMD- 
Logo-Database are gray-scale images that are scanned versions of black and 
white logos. Using an empirically determined preset threshold, we transform 
the input image into a binary image for which we compute its centroid. After 
shifting the image so that the centroid is located at the image center, which 
gives us translational invariance, we rotate the image around the centroid so that 
the major principal axis is aligned with the horizontal. This gives us rotational 
invariance. Finally, we resize the logo component so that its bounding box is 
a given percentage of the image size. This accounts for changes in scale of the 
input logos. These transformations make it possible to perform the following 
computations without reference to orientation, position, and scale. 



3 The Wavelet Method 

Given a normalized image we compute the horizontal and vertical projections 
of this binary image which are defined as P{y) = t(a:, y) and P{x) = 

Y]y^il{x,y). This means that we are counting the number of white pixels for 
each column and row. Next, we use a wavelet transform to apply a low-pass 
Alter to the projections. In our experiments we used the Haar wavelet and the 
Daubechies wavelet s8 as implemented in the MATLAB wavelet toolbox and 
described in PI . We do a 4-level Haar wavelet decomposition and for the 256x256 
images that we used we get 16 low-pass coeffldents per projection. In the case 
of the Haar wavelet this amounts to a repeated process of averaging and down- 
sampling. Finally, we end up with a 32-dimensional vector describing the logo 
as there are 16 coefficients for each of the two coordinate axes. This process is 
illustrated in Figure [D These coefficient vectors, called signatures, are now used 
to compare different logos among each other. We use the Li-Norm to compute 
the difference between their signatures, because the Li-Norm is known to be 
robust against outliers and very fast to compute 
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Original Logo Image (TIFF) Preprocessed Logo 




Wavelet Lowpass Coeff. of Hor. Proj. Wavelet Lowpass Coeff. of Vert. Proj. 





Fig. 1. The Wavelet signatures (from top-down, left-right): original image, normalized 
image, horizontal projection, vertical projection, low-pass wavelet coe dents of hori- 
zontal projection, low-pass wavelet coe dents of vertical projection (a;-axis: index of 
coe cient, y-axis: coe cient magnitude). 



4 The Negative Shape Method 

The novel idea of the negative shape method as defined in for the represen- 
tation of symbol-like data such as found in logos is that we compute the shape 
features not just on the components of the foreground that constitute the symbol 
itself, but also on the components that make up the background of the image 
containing the symbol. 



4.1 Choice of Shape Features 

We start with the normalized images and do a connected component labeling of 
the image. For each component of the labeled image, we compute the following 
shape features: 

1. FI: Invariant moment: The trace of the covariance matrix of the positions 
of the pixels that make up the logo, that is the sum of its diagonal entries. 

2. F2: Eccentricity: The ratio between the length and width of the axis- 
aligned bounding box of the component after the normalization described in 
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Section 21 This gives us information about the extent of the elongation of a 
component. 

3. F3: Circularity: The ratio between perimeter of the component and the 

perimeter of a circle of equivalent area: CIRC = • 

4. F4: Rectangularity: The ratio between the area of the component and the 
area of its bounding box. 

5. F5: Hole Area Ratio: The ratio between the area of the holes inside the 
component and the area of the solid part of the component. 

6. F6,F7: Horizontal (Vertical) Gap Ratio: The ratio of the square of the 
gap count to the area of the component where the gap count is defined as the 
number of pixels inside the component that have a right (bottom) neighbor 
that does not belong to the component. 

4.2 The Classification Procedure 

For the negative shape method we define the distance measure between two logos 
Logoi and Logo 2 as follows: 

1. Normalize the value range for each element of the feature vector over all the 
logos of all the images in the dataset. 

2. For each component of Logo\ find the component of Logo 2 that has the 
smallest distance(L 2 “Horm) in feature space to it. 

3. The average of these minimal distances over all the components of Logoi 
yields a measure for the distance between the two logos. 

5 Comparison between the Methods 

All methods were implemented in Matlab^^ P and were applied to the logos 
contained in the UMD-Logo-Database (123 logos) |2]. The system was tested 
by providing it with an input logo and ranking the logos in the database based 
on their similarity to this logo. All methods always found the matching logo 
in the database. In particular, they ranked it first when the input logo is an 
uncorrupted version of one of the logos in the database. Below, we investigate 
the robustness of the methods when the logos are corrupted using four different 
image degradation methods as described in Figures Et, Et, Et, andEfe.. For each 
method, we degrade the images in the database to a varying degree, input them 
into the classifier, and then examine the rank (in terms of feature space distance) 
of the original, uncompromised logo. Here we examine the median of the rankings 
of the original logo over all the input logos (part b of all the figures) and how 
often in terms of the percent of all logos the original logo was ranked among the 
closest five of all logos (part c of all the figures). Each graph consists of three 
curves: the dashed curve corresponds to the negative shape method, the gray 
curve corresponds to the wavelet method, and the solid curve corresponds to 
the combined method which has not yet been described. The combined method 
was devised based on the results of these experiments and thus we defer its 
explanation and the analysis of the results using this method to the next section 
(i.e. Section EJ once we understand the pros and cons of the two methods. 



Integration of Local and Global Shape Analysis for Logo Classi cation 773 



5.1 Additive Random Noise 

To model the image degradation that is caused by processes such as fax trans- 
missions or photo copying, we add Gaussian noise of zero mean and varying 
standard deviation (varying from 0.1 % to 50 % of the maximum possible pixel 
value of the image as indicated on the x-axis) to the gray-scale input images 
(e.g., Figure 




(a) Example Image (b) Median Rank (c) Percentage of 

Top 5 Rankings 

Fig. 2. Gaussian Noise: The x-axis denotes the standard deviation of the Gaussian 
noise with respect to the maximal pixel value of the original image. The dashed curves 
in (b) and (c) correspond to the negative shape method, the gray curves to the wavelet 
method, and the solid curves to the combined method. 



All the methods perform very well for small amounts of noise, but the wavelet 
method outperforms the negative shape noticeably (Figures|2|3 and|23) for higher 
amounts of noise. Even when applying much noise (e.g., a standard deviation 
which is 20% of the possible pixel value), the average rank of the original logo is 
close to the top 10 (Figure Eb) and about 80% of the logos are ranked in the top 
5 (FigureO:). If we apply the negative shape method to such a heavily degraded 
image, the original logo is ranked in a nearly random manner (median rank 40th 
out of 123 logos as seen in Figure Et>) and the percentage of top 5 classifications 
is below 10% (Figure Efc). 

It is to be expected that the wavelet method outperforms the negative shape 
method when adding random noise since the use of isotropic noise with an equal 
probability for adding or subtracting pixels should have only a small effect on 
the global histogram used in the wavelet method. We use noise of zero mean. 
Consequently, on the average, the distribution of white and black pixels in a 
row or column should not change much, and thus neither should the projection 
change much. On the other hand, in the negative shape method, we compute the 
feature vectors only on a small subset of pixels of each component. In this case, 
the noise will change the spatial distribution of the pixels more drastically be- 
cause of the smaller number of pixels involved. Thus the negative shape method 
is less robust towards zero-mean Gaussian noise than the wavelet method. 
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5.2 Reduced Resolution 

To see how the methods handle differences in image resolution, which is obviously 
not offset by the scaling invariance since we work on digitized images, we reduce 
the size of the input images through sub-sampling using bilinear interpolation 
(e.g., Figure 0. The parameter value is the size ratio between the original and 
the sub-sampled image as indicated on the cc-axis. 




(a) Example Image 




Top 5 Rankings 



Fig. 3. Reduced Resolution: The a:-axis denotes the ratio between the size of the origi- 
nal and sub-sampled images. The dashed curves in (b) and (c) correspond to the nega- 
tive shape method, the gray curves to the wavelet method, and the solid curves to the 
combined method. 



As in Section lb. 1 1 the wavelet method outperforms the negative shape 
method, although the negative shape method does not exhibit the same break- 
down in performance as in the case of random noise. Since we use the low-pass 
wavelet coefficients for the classifier, the reduced resolution does not influence 
the performance of the wavelet method drastically. This is because sub-sampling 
an image by bilinear interpolation has a similar effect as low-pass filtering the 
image. The low-pass wavelet coefficients of a low-pass filtered image are in gen- 
eral very similar to the low-pass coefficients computed on the original image due 
to the fact that the low frequency components of the image are not affected no- 
ticeably by the sub-sampling operation. As before, the negative shape method 
is affected by this degradation because even when large scale changes are hardly 
visible, local shape features such as circularity, rectangularity and gap ratios are 
more susceptible to local changes due to a loss of detail. 

5.3 Occlusion 

To model the occlusion of parts of a logo, we add a component to the logo image 
which in this case is a black rectangle of varying size. The parameter here is the 
percentage of the image that is occluded by the rectangle (e.g.. Figure EJi). 

The performance graphs show that occlusion has a greater effect on the 
wavelet method than the negative shape method (Figures 03 and 0) although 
both methods are able to handle small occlusions well. 
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Since the addition of an extra object or the omission of parts of the image 
causes global changes to the distribution of pixels in each row or column, the 
projections are strongly affected and thus so are the wavelet coefficients. Be- 
cause of the local structure of the shape features, the components that are not 
occluded are not degraded at all and their feature values are unchanged. In the 
classifier we average the best feature vector matches for all the components in 
the input image. Since an occlusion is more likely to combine components into 
larger aggregates than to break them into many new ones, these few new com- 
ponents which do not have a corresponding component in the original image, are 
influencing the feature distance only to a small degree. Except for very degener- 
ate configurations, the influence of the new components is averaged out by the 
continuing good matches of the feature vectors of the remaining uninfluenced 
components. 



5.4 Swirling the Image 

Swirling is a smooth deformation of an image which can be used to model a 
non-isotropic stretching of a logo. The relative position of each row is shifted 
to the left or right by an offset given by a smooth function, where the offset is 
limited to a certain percentage of the image width which is given as a parameter. 
This deforms the logo as if we would stretch a rubber sheet in different directions 
(e.g.. Figure EK). 

This degradation has very different effects on the two methods. The per- 
formance of the wavelet method worsens rapidly with increasing swirl until we 
basically get a random ranking (our test size is 123, therefore, an average rank- 
ing of around 50 is nearly the expected median ranking for a logo that is not in 
the database). In contrast, the median rank of the original logo when using the 
negative shape method is lower than 10 (Figure Et>). It is possible to locally ap- 
proximate this deformation as a combination of translations and rotations. The 
local features used by the negative shape method are rotation and translation 




0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 



(a) Example Image (b) Median Rank (c) Percentage of 

Top 5 Rankings 

Fig. 4. Occlnsion of part of the image: The a;-axis denotes the percentage of image area 
that is occluded. The dashed curves in (b) and (c) correspond to the negative shape 
method, the gray curves to the wavelet method, and the solid curves to the combined 
method. 
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(a) Example Image 




(b) Median Rank (c) Percentage of 

Top 5 Rankings 



Fig. 5. Swirl of the image: The a;-axis denotes the maximum horizontal displacement of 
an image row in percentage of image width. The dashed curves in (b) and (c) correspond 
to the negative shape method, the gray curves to the wavelet method, and the solid 
curves to the combined method. 



invariant due to the component normalization. Therefore, it is much less affected 
by this degradation than the wavelet method. Recall that the wavelet method is 
only globally rotation and translation invariant due to the global preprocessing, 
but not locally. 

6 Combination of Both Methods 

In Section 0 we saw that the wavelet and the negative shape methods perform 
very differently if the input logo is corrupted by either local or global degrada- 
tions. To take advantage of the respective strengths of both methods we devised 
the following performance-dependent weighting scheme. First, for each unde- 
graded logo I in the dataset we compute the average feature space distance of I 
to all other logos for both the wavelet and the negative shape methods. This is 
followed by calculating the average of these average distances for the two meth- 
ods which we denote by for the wavelet method and Ag for the negative 
shape method. We define the ratio between these two averages (i.e. to be 
the expected ratio E for the two methods. We determined how this ratio changed 
when we applied both methods to degraded inputs. The understanding of this 
relationship between the change in ratio and the relative performance of the two 
methods when applied to degraded images enabled us to adaptively weight the 
respective contributions of the two methods when combining them into a single 
distance measure. The relative weights are based on the change in the ratio be- 
cause a a large increase of the feature space distance for one method compared 
to the other indicates a breakdown in its performance. 

When classifying an input logo which has been degraded using one of the 
processes described in Section El we first compute the feature distances of this 
logo to all the other logos for the wavelet method which we denote by W and 
for the negative shape method which we denote by S. In addition, we define 
the averages of W and S over the whole dataset by and Dg, respectively. 
Next, we compare the ratio between and Dg (i.e., to the expected ratio 
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between W and S which we assume to be similar to the precomputed value E. If 
the difference in the ratios indicates that one of the two methods is performing 
worse than expected, we decrease its weight in the final classification and increase 
the weight of the other method. The combined feature distance C for a single 
degraded input logo is a weighted sum of the wavelet method feature distance 
W and negative shape method feature distance S: 



E-Ds 



W 

1e 



s 



( 1 ) 



The factor E, that describes the average ratio between W and S, is only in- 
cluded in order to facilitate understanding the rationale behind the final weight- 
ing method. If we divide W by E, then we effectively normalize W, so that 
its magnitude is equal to the magnitude of S. Thus, if the ratio ^ equals the 
expected ratio E, then we believe that both methods will perform well and we 
use an approximately equal weighting of the two feature distances W and S. If 
now the ratio ^ either grows larger (smaller) than E because the degradation 
of the input logo causes the wavelet method to compute feature distances larger 
(smaller) than the negative shape method (up to the expected ratio E), then 
the contribution of W in equation ^will be reduced (increased) because we have 
less (more) confidence in the wavelet method’s ability to classify the input logo 
correctly. 

This adaptive weighting scheme increases the robustness of the classification 
noticeably. When we examine the performance criteria in Section 0 we see that 
the combined method is able to capture the different behavior of the methods 
and adapts its weights accordingly. Comparing the performance of the combined 
method on images degraded as described in Section 15.11 and Section 15.41 where 
the wavelet and the negative shape method exhibit very different performances, 
we see that our weighting scheme is able to detect the change in relative perfor- 
mance and adjust the weights to mimic the classification of the better performing 
method. For the degradations described in Sections 15.21 a.nd 15.51 where the per- 
formance difference between the two basic methods is not as pronounced, the 
combined method lags slightly behind the better performing method in the me- 
dian rank criterion (part (b) of all the Figures), but equals or surpasses the 
performance of the better method in terms of the other criterion (part (c) of 
all the Figures). This shows that our combined scheme is effective in capturing 
global as well as local shape information and is thus able to deal well with the 
image degradations of the kind that we described. 



7 Summary and Future Work 

Both the wavelet as well as the negative shape method are well-suited for cer- 
tain kinds of image degradations but are very sensitive to others. This discrep- 
ancy in performance can be explained by the difference between local shape 
feature-based and global, filter-based methods. On the one hand, we have the 
wavelet method that operates on the global image and computes features that 
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are relatively invariant to degradations that are isotropic. On the other hand, 
we have the negative shape method which operates on local image regions. Thus 
its features are relatively invariant to changes that leave the image at other lo- 
cations mostly intact such as occlusions or preserve the local image structure 
such as the swirl deformation. We take advantage of the fact that both basic 
methods perform very differently on images that exhibit degradations of either 
local or global nature by devising a performance-dependent weighting scheme 
that combines the results of both methods. Our combined algorithm shows a 
noticeable improvement in the robustness of the classification by combining the 
strengths and avoiding the weaknesses of the respective methods. This weighting 
scheme performs the better the more different the performances of the underlying 
methods are because this makes it easier to detect if one method is performing 
poorly with respect to the other method. Therefore, the wavelet and the negative 
shape methods are very well-suited to be combined by a performance-dependent 
weighting scheme. 

For future work it is planned to improve the synergy between the two meth- 
ods by using local image information to estimate how much an image region is 
degraded and then use this locality information to adaptively weigh the feature 
vectors on the component level. 
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Abstract. In this paper, we are presenting our results for motion tracking 
animals in stabling. This system was used in order to record the behavior of 
pigs in stabling. We used an object-oriented method for our application instead 
of a block-oriented method. First of all, we calculated a reference image. This 
image was used in order to separate the objects from the background. Then, 
object pixels were grouped into an object by the line-coincidence method. 
Movement parameters are calculated for each object. Finally, an object 
correction is done for those objects that were occluded by the boundary of the 
stabling. The resulting tracking path and the movement parameters are 
displayed on screen for the user. 



1 Introduction 

Researchers and farmers record the movements of animals on video in order to 
observe animal behavior. Afterwards, they analyze these videos by looking up each 
image sequence and taking notes about the spatial position of each animal. This is a 
very time consuming process. However, such an observation of animals is important 
in order to understand the behavior of animals in stabling and other environments 
[1][2]. The resulting knowledge can help to improve animal welfare as well as meat 
quality. 

The task of behavior analysis is not only important for the study of animal welfare, 
it also becomes important for many other tasks such as group behavior analysis in 
public traffic areas, soccer game reporting and pharmacological studies. 

An automatic system has to recognize the object and to track the object before it is 
possible to describe the semantic concepts of the behavior of the observed object such 
as object "stand", "moves" or more complex concepts such as object under "nervous 
excitement". In this paper, we are presenting our results for motion tracking of pigs in 
stabling. 

Visual object tracking has become an important subject in computer vision. We 
can identify block-oriented and object-oriented methods [3]. An object-oriented 
method is described in Mae et. al [4]. They combined the optical flow with edge 
detection in order to separate objects from background. The intention of this work is 
contour detection of moving objects in a highly structured background. In Hoetner et. 
al [5] a block-oriented approach is described that uses the signal difference and the 
texture for the detection of moving objects. In Iketani et. al [6] the segmentation of 
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the images is done by partitioning image into blocks and determining the value of the 
optical flow for each block. Regions with the same value for the optical flow were 
grouped into one object. That allows us to detect objects on an in-stationary 
background. For our application, we chose an object-oriented method to prevent 
objects getting separated or combined together by the blocks. 



2 Image Acquisition 

The movements of the pigs in stabling were recorded by a video camera when the 
pigs came into stabling for the first time. The length of the video was from 20s to one 
minute. The camera had a fixed position that allowed us to look at the stabling 
diagonally from the top. In each video, we could see the boundary of the stabling. 
First, images from the empty stabling were taken for 0.5 s. The spatial resolution of 
the image sequence is 768 x 576 pixels for each image. It is possible to see the time in 
each video at the lower right corner. 



3 Outline of Our Method 

The outline of our method is shown in Figure 1. The image sequence of the empty 
stabling is extracted from the video and given to the pre-processing unit. This unit 
calculated a reference image, a threshold and new matrix which contains the boundary 
of the stabling. Afterwards, objects and background were separated in each image of 
the video sequence based on the reference image. Then, objects were determined in 
the image and the motion parameters were calculated. Objects may be occluded by 
the boundary of the stabling. The objects and motion parameters were corrected for 
this reason. Finally, the tracking path and the motion parameters were displayed on 
screen. 



4 Image Pre-processing 

The image sequence from the empty stabling (see Figure 2) was extracted from the 
video sequence and a reference background image was calculated from these images: 

{anf^{i,i)\k = \,..,K} ref {i, j) = ^Y^anf^(i, j) ■ (1) 

k k=i 



In addition to that, we extracted the boundary of the stabling from the reference 
image ref(i,j) and stored it in an image matrix called gitter(i,j). For that reason, we 
calculated the histogram of ref(i,j) and determined the grey level threshold which 
allowed us to separate the background from the boundary of the stabling. A pixel in 
the matrix gitter(if) is one if the pixel belongs to the boundary of the stabling and it is 
zero if the pixel does not belong to the boundary. 
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Fig. 1. Overall Structure of our Algorithm 



5 Separation of Object and Background 

A threshold was determined from the initial images and the reference images in order 
to separate objects and the background. We determined the variance from the 
difference of the initial images and the reference image: 

sq = X (“"/t ~ 

Afterwards, we calculated the histogram over all difference pixel. The threshold 
for the object segmentation was determined: 

rtf 

V h(sq) = 0.95 thresh = 2 • grenz '■•’I 

,v=0 

Now, the reference image ref(i,j) was subtracted from the actual image act(i,j). The 
resulting image was segmented into object pixel and background pixel based on the 
threshold described above. The resulting binary image has the name arb(ij). The 
object pixels were grouped into separate objects by the line coincidence method [7]. 
Each object was labelled by an object number objnr. Objects smaller than a 
predefined threshold were interpreted as image noise and eliminated from the list of 
objects. 
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Fig. 2. Empty stabling 

After the first real object was found the determination of the object position and 
the movement vector were started. 



6 Determination of the Object Position and the Movement Vector 

The center of gravity mi and mj were determined for each object. Then, the movement 
vector of each object was determined by the following equation: 

bewi(objnr,k) = mi(objnr,k + n + 1) - mi(objnr,k) 
bewj(objnr, k) = mj (objnr, k + n + V)-mj (objnr, k) 

Unfortunately, this method for the determination of the movement vector has some 
disadvantages. The method can not determine rotations of the object according to its 
inner axis and 3D movements. However, it was sufficient for our problem. 

Another parameter that is determined is the area of each object anz at each time t. 
That means that for each object we determined objnr, mi, mj, bwei, bewj and anz. 
These parameters are the basis for further determination. 



7 Correction of the Objects and the Object Parameters 

The boundary of the stabling can sometimes partially occlude the pigs. For instance, 
objects may be separated by the boundary. The computed movement parameters were 
used to correct these disturbances and deformations. The binary matrix arb(i,j,k) at 
the time k were taken in order to prove, for each object, if the object was behind or in 
front of the boundary. In the case where the object was in front of the boundary then 
the object occluded the boundary and no correction was necessary. Only in the case, 
where the object was behind the boundary was a correction necessary. For that the 
object was extracted from arb(i,j) so that we obtained a new matrix arbl(i,j) that only 
contained the pixel of this object. The matrix arbl(i,j) was added up with the matrix 




Motion Tracking of Animals for Behavior Analysis 



783 




Fig. 3. Original image (first pig comes into the stahling) and result of motion analysis 

gitter(ij). In the case that the resulting matrix had the value “2” inside then no 
correction was necessary. If this case did not occur then a dilation was done on 
arbl(i,j) 



arblii, j)® M = {(i, j) : M. j n arbl {, ) ^ oj (5) 

M is a 3 X 3 mask containing the value “1” and M. . is the mask M that was shifted 
to the pixel (i,j). The resulting matrix was computed with the mask gitter (i,j) by the 
logical AND function. If the result was zero, then the object was not occluded by the 
boundary of the stabling and correction was not necessary. A correction was made if 
the resulting values of this operation were “1”. The object was isolated from the 
matrix arb(i,j,k-l). Then, it was shifted by the calculated motion vector (=(tmns,i,j)) 
and combined with the matrix arbl(ij) by the logical AND function. For the resulting 
corrected object the proof was carried out as to whether has a nonempty intersection 
with the object in arb(i,j,k). Then, these objects were combined to a single object. 
Afterwards, the parameter mi(,), mj(,), bewi(,), bewj(,), and anz were calculated for 
the corrected object. 




Fig. 4. Image number 82 and tracking path of pigs 
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8 Output of the System 

The resulting tracking path of each object is displayed on screen. Figures 3-8 show 
images at different times t and the tracked path of the objects. It shows that we are 
able to track the objects by our algorithm. 

However, on the recorded path we can see that the movement of the pig is not a 
smooth line. The pigs are tripping a bit back and forth at the same place. They rotate 
on their own axis. 

Our system records for each pig the coordinates, motion parameters and the time. 
It gives us the basic information needed for behavior analysis. The next step must be 
the mapping of the information extracted from the image to the semantic concepts that 
the veterinarian needs for his analysis. Only when this task is solved do we have a 
fully automatic system. However, it also shows the complexity of the task of behavior 
analysis. The tracking of the objects is not simple and it is even harder in a real world 
environment. The next step, the mapping of the image information to the semantic 
concepts needs a clear understanding of what the concepts are and how we can 
describe them by the image content. 

Recently, the veterinarians have shown they are happy with a listing of the 
coordinates, the motion parameters and the associated time for each pig. 




Fig. 6. Image number 118 and tracking path of pigs 
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9 Conclusion 

We have presented our system for motion tracking of animals in stabling. Our system 
was used for the analysis of movements of pigs when they enter new stabling. 

Furthermore, we have investigated other methods for motion tracking. However, 
we found that our method has several advantages over these methods for our 
application. First of all, it is easy to calculate. Secondly, it can correct occluded object 
parts, which helps to improve the determination of the object position and the motion 
parameters. Finally, it takes into account the changing shape of the objects that, for 
example, occurs by rotation of an object. 




Fig. 7. Image number 205 and tracking path of pigs 




Fig. 8. Image number 327 and tracking path of pigs 
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Abstract. This paper describes a practical system developed for gen- 
erating 3D models of human heads from silhouettes alone. The input to 
the system is an image sequence acquired from circular motion. Both the 
camera motion and the 3D structure of the head are estimated using sil- 
houettes which are tracked throughout the sequence. Special properties 
of the camera motion and their relationships with the intrinsic parame- 
ters of the camera are exploited to provide a simple parameterization of 
the fundamental matrix relating any pair of views in the sequence. Such 
a parameterization greatly reduces the dimension of the search space for 
the optimization problem. In contrast to previous methods, this work can 
cope with incomplete circular motion and more widely spaced images. 
Experiments on real image sequences are carried out, showing accurate 
recovery of 3D shapes. 



1 Introduction 

The reconstruction of 3D head models has many important applications, such as 
video conferencing, model-based tracking, entertainment and face modeling Ha- 
Existing commercial methods for acquiring such models, such as laser scans, are 
expensive, time-consuming and cannot cope with low-reflectance surfaces. Image 
based systems can easily overcome these difficulties by tracking point features 
along video sequences p]. However, this can be remarkably difficult for human 
faces, where there are not many reliable landmarks with long life span along the 
sequence. 

In this paper we present a practical system for generating 3D head models 
from silhouettes alone. Silhouettes are comparatively easy to track and pro- 
vide useful information for estimating the camera motion mini and reconstruc- 
tion mm- Since they tend to concentrate around regions of high curvature, 
they provide a compact way of parameterizing the reconstructed surface. In our 
system, images are acquired by moving the camera along a circular path around 
the head. This imposes constraints on the fundamental matrix relating each 
pair of images, simplifying the motion estimation. The system does not require 
the motion to be a full rotation and the images can be acquired at more widely 
spaced positions around the subject, an advantage over the technique introduced 
in [rUj . 
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Section |2| presents the theoretical background of motion estimation from 
silhouettes. The algorithms for model building are described in Section E| and 
Section E] shows the experimental results. Conclusions are given in Section El 



2 Theoretical Background 

The fundamental difficulty in solving the problem of structure and motion from 
silhouettes is that, unlike point or line features, the silhouettes do not readily 
provide image correspondences that allow for the computation of the epipolar 
geometry, summarized by the fundamental matrix. The usual solution to this 
problem is the use of epipolar tangencies im, as shown in Fig. EJ An epipolar 
tangent point is the projection of a frontier point 0, which is the intersection of 
two contour generators. If 7 or more epipolar tangent points are available, the 



frontier point 




epipole 



Fig. 1. A frontier point is the intersection of two contour generators and is visible in 
both views. The frontier point projects onto a point on the silhouette which is also on 
an epipolar tangent 



epipolar geometry can be estimated. The intrinsic parameters of the cameras can 
then be used to recover the motion PI. However the unrealistic demand for a 
large number of epipolar tangent points makes this approach impractical. By 
constraining the motion to be circular, a parameterization of the fundamental 
matrix with only 6 degrees of freedom (dof) is possible [li dibit)] . This parameter- 
ization explicitly takes into account the main image features of circular motion, 
namely the image of the rotation axis, the horizon and a special vanishing point, 
which are fixed throughout the sequence. This makes it possible to estimate the 
epipolar geometry by using only 2 epipolar tangencies p. 

In |1()| , a practical algorithm has been introduced for the estimation of motion 
and structure from silhouettes of a rotating object. The image of the rotation 
axis and the vanishing point are first determined by estimating the harmonic 
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homology associated with the image of surface of revolution spanned by the ob- 
ject. In order to obtain such an image, a dense image sequence from a complete 
circular motion is required. In this paper, the parameters of the harmonic homol- 
ogy and other motion parameters are estimated simultaneously by minimizing 
the reprojection errors of epipolar tangents. This algorithm does not require the 
image of such a surface of revolution and thus can cope with incomplete cir- 
cular motion and more widely spaced images, an advantage over the algorithm 
presented in mg. 



2.1 Symmetry and Epipolar Geometry in Circular Motion 

Consider a pinhole camera undergoing circular motion. If the camera intrinsic 
parameters are kept constant, the projection of the rotation axis will be a line b 
which is pointwise fixed on each image. This means that, for any point x on Ig, 
the equation x"^Fx = 0 is satisfied, where F is the fundamental matrix related 
to any image pair in the sequence. For circular motion, all the camera centers 
lie on a common plane. The image of this plane is a special line Ih, the horizon. 
Since the epipoles are the images of the camera centers, they must lie on Ij,. In 
general, Ig and Ih are not orthogonal. Another feature of interest is the vanishing 
point Vj, which corresponds to the normal direction of the plane defined by the 
camera center and the axis of rotation. The vanishing point and the horizon 
satisfy vjlh = 0. A detailed discussion of the above can be found in PH5ISI. 

Consider now a pair of cameras, denoted as Pi and P 2 , related by a rotation 
about an axis not passing through their centers, and let F be the fundamental 
matrix associated with this pair. It has been shown that corresponding epipolar 
lines associated with F are related to each other by a harmonic homology W |0|, 
given by 



W = I-2 



vjls' 



( 1 ) 



Note that W has 4 dof: 2 corresponding to the axis and 2 corresponding to the 
vanishing point. If Pi and P 2 point towards the axis of rotation, v^, will be at 
infinity and W will be reduced to a skew symmetry S with only 3 dof. Besides, 
if the cameras also have zero skew and aspect ratio 1, the transformation will 
be further specialized to a bilateral symmetry B with only 2 dof. A pictorial 
description of these transformations can be seen in Fig. El 

In [I (ij . an algorithm has been presented for estimating the camera intrinsic 
parameters from 2 or more silhouettes of surfaces of revolution. For each sil- 
houette, the associated harmonic homology W is estimated and this provides 2 
constraints on the camera intrinsic parameters: 

V, = KK^lg, (2) 



where K is the 3x3 camera calibration matrix. Conversely, if the camera intrinsic 
parameters are known, © provides 2 constraints on W and as a result W has 
only 2 dof. 
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Fig. 2. (a) A curve displaying bilateral symmetry. The horizon is orthogonal to the 
axis, (b) Same cnrve, distorted by an affine transformation. The horizon is no longer 
orthogonal to the axis, and each side of the curve is mapped to the other by a skew 
symmetry transformation, (c) The curve is now distorted by a special projective trans- 
formation (harmonic homology), and the lines of symmetry intersect at a point corre- 
sponding to the vanishing point 



2.2 Parameterization of the Fhndamental Matrix 

In |8I1 7j . it has been shown that any fundamental matrix F can be parameterized 
as F = [e 2 ]xM, where M”"”" is any matrix that maps the epipolar lines from 
one image to the other, and e 2 is the epipole in the second image. In the special 
case of circular motion, it follows that 

F=[e2]xW. (3) 

Note that F has 6 dof: 2 to fix e 2 , and 4 to determine W. From 0, if the camera 
intrinsic parameters are known, 2 parameters are enough to define W and thus 
F will have only 4 dof. 

An alternative parameterization for the fundamental matrix in the case of 
circular motion \nm is given by 

Q 

F = [vx] X + K tan - (Iglh + IhlJ) , (4) 

where 6 is the angle of rotation between the cameras. The constant k can be 
determined from the camera intrinsic parameters |0| if Is, and Ih are properly 
normalized. 9 is the only parameter which depends on the particular pair of 
cameras being considered, while the other 4 terms are common to all pairs of 
images in the sequence. When the camera intrinsic parameters are known, 2 
parameters are enough to fix Ig and v^,. Since must lie on Ih, only 1 further 
parameter is needed to fix Ih. As a result, the fundamental matrix has only 4 
dof. 

3 Algorithms 

Before the 3D model can be reconstructed from the silhouettes of the head, the 
motion of the camera has to be estimated. By using the parameterization shown 
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in o, the ('^) = N{N — l)/2 fundamental matrices relating all possible pairs 
of cameras in a sequence of N images, taken by a rotating camera with known 
intrinsic parameters, can be defined with the 3 parameters which fix b, v^, and 
Ih, together with the — 1 angles of rotation between adjacent cameras. By 
enforcing the epipolar constraint on the corresponding epipolar tangent points, 
these N + 2 motion parameters can be estimated by minimizing the reprojection 
errors of corresponding epipolar tangents (see Fig. |3). Since a silhouette has 
at least two epipolar tangencies (one at its top and another at its bottom), 
there will be totally 2('^) = N{N — 1) measurements from all pairs of images. 
Due to the dependence between the associated fundamental matrices, however, 
these N{N — 1) measurements only provide 2N (or 2 when N = 2) independent 
constraints on the N + 2 parameters. As a result, a solution will be possible if 
iV > 3. 




Fig. 3. The parameters of the fundamental matrix associated with each pair of images 
in the sequence can be estimated from the reprojection errors of epipolar tangents. The 
solid lines are tangents to the silhouettes passing through the epipoles, and the dashed 
lines are the epipolar lines corresponding to the tangent points 



The minimization of the reprojection errors will generate a consistent set 
of fundamental matrices, which, together with the camera intrinsic parameters, 
can be decomposed into a set of camera matrices describing a circular motion 
compatible with the image sequence. The algorithm for motion estimation is 
summarized in Algorithm H Having the motion of the camera estimated, the 3D 
model can then be reconstructed from the silhouettes using the simple triangu- 
lation technique introduced in uni 
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Algorithm 1 Estimation of the motion parameters from silhouettes 
track the silhouettes of the head using cubic B-spline snakes; 
initialize R, h and the N — 1 angles between the N cameras; 
while not converged do 
for each image pair do 
form fundamental matrix; 
locate epipolar tangents; 
compute reprojection errors; 
end for 

update parameters to minimize the snm of reprojection errors; 

end while 



4 Experiments and Results 

In order to evaluate the performance of the algorithm described in Section 01 
2 human head sequences, each with 10 images, were acquired using the setup 
shown in Fig. ^ The camera is mounted to the extensible rotating arm of the 
tripod, whose height can be adjusted according to the height of the subject. Each 
image in the sequence was taken after rotating the arm of the tripod roughly by 
20°, with the subject standing close to the tripod. The intrinsic parameters of 
the camera are obtained from an offline calibration process. The silhouettes of 
the heads are tracked using cubic B-spline snakes |2| (see Fig. 0 andEJ. 




Fig. 4. Experimental setup used to acquire image sequences around human heads. The 
camera is mounted to the rotating arm of the tripod with the subject standing close 
to the tripod. Although the camera motion is constrained to be circular, the camera 
orientation and rotation angle are unknown 



The initial guess for the horizon and the image of the rotation axis was picked 
by observation, and the angles of rotation were initialized as 10° respectively. The 




Head Model Acquisition from Silhouettes 



793 




Fig. 5. Image sequence (I) used in the experiment, with the silhouettes of the head 
tracked using cubic B-spline snakes 




Fig. 6. Image sequence (II) used in the experiment, with the silhouettes of the head 
tracked using cubic B-spline snakes 



sum of the reprojection errors was minimized using the Levenberg-Marquardt 
algorithm [Zj.The reconstructed 3D head models can be found in Fig. QandlSl 
The shapes of the ears, lips, noise and eyebrows demonstrate the quality of the 
3D models recovered. 



5 Conclusions 

In this paper we have presented a simple and practical system for building 3D 
models of human heads from image sequences. No prior model is assumed, and 
in fact the system can be applied to a variety of objects. The only constraint 
on the camera is that it must perform circular motion, though the exact camera 
orientations and positions are unknown. Besides, the camera is not required to 
perform a full rotation and there is no need for using a dense image sequence. 
The silhouettes of the head are the only information used for both motion esti- 
mation and reconstruction, circumventing the lack, instability and occlusion of 
landmarks on faces. The silhouettes also provide a natural and compact way of 
parameterizing the head model, concentrating contours around regions of high 
curvature. The experimental results show the accuracy of the acquired model. 





794 K.-Y.K. Wong, P.R.S. Mendonga, and R. Cipolla 




Fig. 7. Different views of the VRML model from the model building process using the 
10 images in Fig. 
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Fig. 8. Different views of the VRML model obtained from the model building process 
using the 10 images in Fig. 0 
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